Project

Content

The course project is intended as an opportunity to engage in research on machine learning theory. Hence, the project must contain a theoretical component. You are free to pick any topic you like, within reason.

Some examples of suitable project “types” are as follows.

Prove a new, interesting theoretical result in a new or existing machine learning model. You can be ambitious, but do also aim for something interesting to show by the end of the semester.

Example: the theory parts of this paper on Brown clustering. The theory here is actually fairly simple, and could probably be easily elaborated.
Simplify an existing but complicated result in a new, non-trivial way.
Example: Dasgupta and Gupta’s elementary proof of J-L lemma (although producing a proof this miraculous is not required!).
Unify and clarify the relationships between several (at least three or four) papers with a common theme or topic in a survey paper. Note that projects of this type will be judged especially critically for clarity and writing quality.

A project that “just” implements an algorithm and runs some experiments is not suitable, even if the algorithm comes from a theory paper.

Submission

The final product is a written report. The length of the report will depend on the context. For example, this paper introduces a problem and states the main results in about eight pages, and the proof is in a nine-page appendix. As another example, this paper simplifies the main result of an award-winning PhD thesis in a single page. Quality and clarity are more important than quantity.

There is a tentative plan to hold a poster session during the final exam period for this course (Dec 21). The posters don’t need to be very elaborate (e.g., feel free to draw them by hand), but you should prepare a short oral presentation (about five to seven minutes) that is visually aided by the poster.

Update: there aren’t enough research projects to warrant having a poster session, so we’ll forego this.

Deadlines

Oct 12: Submit a project proposal (one or two pages should suffice), describing the project type, the chosen topic, and some preliminary “leg-work”. For instance:
- If you seek to prove a new result, formally state the problem or conjecture, along with background information explaining why such a result is interesting given the prior art, and why you think you can obtain the result (e.g., possible approaches, high-level ideas why the result should be true).
- If you seek to simplify an existing result, formally state the existing result and what is unsatisfactory about the existing solution, give evidence that the result remains unsimplified to this date (e.g., by referencing its use as a “black-box” in subsequent works), and explain why you think you can obtain the simplification.
- If you seek to unify several papers in a survey, formally introduce the common theme or topic of the papers, briefly summarize the high-level ideas of each paper, and explain your unifying thesis and why it transcends the individual papers themselves. You do not have to have read all of the selected papers before submitting the proposal, but you should have a few additional “back-up” papers in case some of the unread selected papers don’t turn out to be relevant.
Nov 16: Submit a progress report (one or two pages).
Dec 21: Submit the final project report.

If you need to change your project type or topic after submitting the proposal, you must first confirm this with the instructor and submit a new proposal.

Some suggested topics / papers to read

Some relevant publication venues to check:
- Conferences: AISTATS, COLT, ICML, NIPS, UAI
- Journals: Annals of Statistics, IEEE Transactions on Information Theory, Journal of Machine Learning Research, Machine Learning Journal

Random linear maps

Behavior of random linear maps on distributions
- Is there a simple proof of the Diaconis-Freedman result in higher dimensions?
- Dasgupta, Hsu, and Verma, A concentration theorem for projections
- Klartag and Mendelson, Empirical processes and random projections
Bounds on the target dimension for manifolds embeddings
- Are any of these bounds tight?
- Clarkson, Tighter bounds for random projections of manifolds
- Verma, A note on random projections for preserving paths on a manifold
Hashing versions of JL
- Dasgupta, Kumar, and Sarlos, A sparse Johnson-Lindenstrauss transform
- Kane and Nelson, Sparser Johnson-Lindenstrauss transforms
Using random linear maps to improve learning
- Random linear maps have been used to improve nearest neighbor search and vector quantization.
- What kinds of learning problems do these techniques help us solve? There is some work on classification and regression problems, but the results make very strong assumptions.
- Andoni and Indyk, Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
- Indyk and Naor, Nearest neighbor preserving embeddings
- Dasgupta and Freund, Random projection trees and low dimensional manifolds
- Kpotufe and Dasgupta, A tree-based regressor that adapts to intrinsic dimension
Other papers
- Krahmer, Mendelson, and Rauhut, Suprema of chaos processes and the restricted isometry property
- Ailon and Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform
- Matousek, On variants of the Johnson–Lindenstrauss lemma

Learning mixture models

Mixtures of Gaussians
- There are several papers that learn mixtures of Gaussians under certain assumptions in \(\mathrm{poly}(k,d)\) time. But can their sample complexity be improved?
- Arora and Kannan, Learning mixtures of separated nonspherical Gaussians
- Dasgupta and Schulman, A probabilistic analysis of EM for mixtures of separated, spherical Gaussians
- Vempala and Wang, A spectral algorithm for learning mixtures of distributions
- Achlioptas and McSherry, On spectral learning of mixtures of distributions
- Hsu and Kakade, Learning mixtures of spherical Gaussians: moment methods and spectral decompositions
Can mixture models be learned under weaker distributional assumptions?
- Diakonikolas, Kamath, Kane, Li, Moitra, and Stewart, Robust estimators in high dimensions without the computational intractability

Speeding-up linear algebraic computations

Good overview papers
- Mahoney, Randomized algorithms for matrices and data
- Halko, Martinsson, and Tropp, Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions
Determinisic matrix sketching
- Liberty, Simple and deterministic matrix sketching and Ghashami and Phillips, Relative errors for deterministic low-rank matrix approximations
Online PCA / power method
- There have been several papers recently on analyzing online PCA algorithms.
- Balsubramani, Dasgupta, and Freund, The fast convergence of incremental PCA
- Boutsidis, Garber, Karnin, and Liberty, Online principal component analysis
- Jain, Jin, Kakade, Netrapalli, and Sidford, Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm
- Balcan, Du, Wang, and Yu, An improved gap-dependency analysis of the noisy power method

Clustering

\(k\)-means
- Dasgupta, The hardness of k-means clustering
- Telgarsky and Dasgupta, Moment-based uniform deviation bounds for k-means and friends
- Kumar, Sabharwal, and Sen, Linear-time approximation schemes for clustering problems in any dimensions
Why do certain \(k\)-means algorithms work better than others?
- It would be great to better understand the “swap” local search method or greedy methods.
- Kanungo, Mount, Netanyahu, Piatko, Silverman, and Wu, A local search approximation algorithm for k-means clustering
- Makarychev, Makarychev, Sviridenko, and Ward, A bi-criteria approximation algorithm for k means
- Hsu and Telgarsky, Greedy bi-criteria approximations for \(k\)-medians and \(k\)-means
- Ostrovsky, Rabini, Schulman, and Swamy, The effectiveness of Lloyd-type methods for the k-means problem
- Ward’s method is an agglomerative clustering method that can be viewed as a \(k\)-means clustering algorithm (when you stop the algorithm as soon as it has \(k\) clusters). Can this method be rigoriously analyzed in some setting?
Bregman divergences
- The \(k\)-means problems (and many algorithms for it) generalize nicely to the family of Bregman divergences, which generalize squared Euclidean distance.
- But there are very few non-trivial theoretical results. (Existing results basically assume the Bregman divergence is essentially squared Euclidean distance.)
- Banerjee, Merugu, Dhillon, and Ghosh, Clustering with Bregman divergences
Planted partition models
- McSherry, Spectral partitioning of random graphs
- Bshouty and Long, Finding planted partitions in nearly linear time using arrested spectral clustering
- Chaudhuri, Chung, Tsiatas, Spectral clustering of graphs with general degrees in the extended planted partition model
- Oymak and Hassibi, Finding dense clusters via “low rank + sparse” decomposition
- Cole, Friedland, and Reyzin, A simple spectral algorithm for recovering planted partitions

Sparsity

Compressed sensing / sparse linear regression
- Bickel, Ritov, and Tsybakov, Simultaneous analysis of Lasso and Dantzig selector
- Donoho and Stark, Uncertainty principles and signal recovery
- Zhang, Adaptive forward-backward greedy algorithm for learning sparse representations
- Zhang, Sparse recovery with orthogonal matching pursuit under RIP
Dictionary learning
- Spielman, Wang, and Wright, Exact recovery of sparsely-used dictionaries
- Luh and Vu, Dictionary learning with few samples and matrix concentration
- Agarwal, Anandkumar, Jain, Netrapali, and Tandon, Learning sparsely used overcomplete dictionaries
- Arora, Ge, and Moitra, New algorithms for learning incoherent and overcomplete dictionaries
- Arora, Ge, Ma, and Moitra, Simple, efficient, and neural algorithms for sparse coding