Given data from an experiment, study or population, inferring information from the underlying probability distribution it defines is a fundamental problem in Statistics and data analysis, and has applications and ramifications in countless other fields — and in Theory as well. Tackling this problem from a computational viewpoint is the objective of distribution testing: to start off the day, we survey recent developments in this area, (a subset of) the new directions taken and connections made, and some of the exciting “applications and ramifications” these spawned.
The prototypical question in distribution property testing is the following: Given sample access to one or more discrete distributions, determine whether they have some global property or are far from having the property in \ell_1 distance. We will describe a simple unified framework to obtain sample-efficient testers in this setting, by reducing \ell_1-testing to \ell_2-testing. Using our framework, we obtain optimal estimators for a wide variety of \ell_1 distribution testing problems, including the following: identity testing to a fixed distribution, closeness testing between two unknown distributions (with equal/unequal sample sizes), independence testing (in any number of dimensions), closeness testing for collections of distributions, and testing k-flatness. For most of these problems, our approach gives the first optimal tester in the literature. Moreover, our testers are significantly simpler to analyze compared to previous approaches. As an important application of our reduction-based technique, we obtain the first adaptive algorithm for testing equivalence between two unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions — as opposed to merely their domain size — and is significantly better compared to the worst-case optimal tester in many natural instances. Moreover, our technique naturally generalizes to other metrics beyond the \ell_1-distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence tester under the Hellinger distance.
Joint work with Daniel Kane.
Up to now, there exist three distinct methodologies that are provably achieving optimal estimation and testing performances for a wide range of statistical properties, including the Shannon entropy, mutual information, the Kullback—Leibler divergence, the total variation distance, among others. These three approaches have intimate connections, reflect key milestone ideas in statistics and machine learning, and have far reaching applications beyond distribution estimation and testing. We discuss the fundamental ideas behind these three approaches, their relative strengths and weaknesses, as well as advice for their usage in practice.
Based on joint work with Yanjun Han, Dmitri Pavlichin, Kartik Venkat, and Tsachy Weissman.
Symmetric distribution properties such as support size, support coverage, entropy, and proximity to uniformity, arise in many applications. Specialized estimators and analysis tools were recently used to derive asymptotically sample-optimal approximations for each of these properties. We show that a single, simple, plug-in estimator—profile maximum likelihood (PML)—is sample competitive for all symmetric properties, and in particular is asymptotically sample-optimal for all the properties above.
Joint work with Jayadev Acharya, Hirakendu Das, and Ananda Theertha Suresh.
Traditionally, distribution testing has focused on testing with respect to the total variation distance. In this talk, I will discuss some results on distribution testing with other distances, including \chi^2, Kullback-Leibler, Hellinger, and \ell_2. I'll also go into motivation for testing with other distances, including applications to testing problems both new and old (i.e., testing independence and monotonicity), and allowing for tolerance to model misspecification.
Based on joints works with Jayadev Acharya, Constantinos Daskalakis, and John Wright.
How many samples from a multi-dimensional distribution are necessary to distinguish whether it is a product measure or whether it is 10%-far in total variation distance from being product? As it turns out, answering this question rigorously requires exponentially many samples in the dimension. Similar lower bounds apply to a host of statistical testing problems in high dimensions. So what do we really know about high-dimensional distributions and the important phenomena that they model? I will propose a way out of the conundrum with an overview of recent work on testing structured high-dimensional distributions: Bayesian networks and Markov Random Fields. A combination of information-theoretic and statistical physics techniques will yield efficient testing from a number of samples that is a low polynomial in the dimension, and in some cases even just a single sample from the underlying distributions.
Based on joint work with Nishanth Dikkala, Gautam Kamath, Qinxuan Pan.
Let \mathrm{p} be an unknown source of randomness with n basic outcomes. We're interested in the usual questions — e.g., the number of samples required to fully learn \mathrm{p}, to test whether \mathrm{p} is close to some fixed hypothesis \mathrm{q}, to estimate the entropy of \mathrm{p}, etc. Recently, sharp upper bounds have been found for these problems (e.g., O(n^2/\varepsilon^2), O(n/\varepsilon^2), O(n^2/\varepsilon + \log^2 n / \varepsilon^2) for the aforementioned problems). Did I mention that \mathrm{p} is a quantum state, the noncommutative cousin of an n-outcome probability distribution? We're learning and testing quantum states.
In this talk we'll survey techniques used in the area. Sometimes it's the “usual thing”: probabilistic analysis of random histograms like ; or, collision-tester/unbiased-estimator/variance-analysis for testing if p is the maximum-entropy distribution. Other times we'll need to delve into diverse older topics — going back in time to The Art of Computer Programming Vol. 3 (Sorting and Searching), or even further back in time to the representation theory of the symmetric group.
In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, we propose the methodology of sampling correctors. Such algorithms use structure that the distribution is purported to have, in order to allow one to make “on-the-fly” corrections to samples drawn from probability distributions. These algorithms may then be used as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks.
Warning: This talk contains more questions than answers...
Joint work with Clément Canonne and Themis Gouleakis.
We initiate a study of proofs of proximity for properties of distributions, which are proof systems within the framework of distribution testing. We investigate the power and limitations of several types of these proof systems, including the distribution testing analogues of NP, MA, and IP. In particular, we show that proof systems can significantly reduce the complexity of testing natural properties of distributions.
Joint work with Alessandro Chiesa.