FOCS 2017 Workshop: Frontiers in Distribution Testing

Overview

In the past several years, there has been a revival of interest in the field of distribution testing, with a flurry of results showing how to test and estimate properties with a sample complexity which is sublinear in the domain size. Indeed the number of recent works may even feel overwhelming to the casual observer, and the literature hard to navigate. The goal of this workshop is to catch the community up in recent developments, and highlight some of the most interesting frontiers in distribution testing.

Open Problems

The workshop included an Open Problems session (see the schedule below), which resulted in a list of 12 open questions and directions to tackle and explore (available here).

Schedule (Tentative)

9:00-9:35 Clément Canonne

Introductory remarks: a sample of what to expect

Given data from an experiment, study or population, inferring information from the underlying probability distribution it defines is a fundamental problem in Statistics and data analysis, and has applications and ramifications in countless other fields — and in Theory as well. Tackling this problem from a computational viewpoint is the objective of distribution testing: to start off the day, we survey recent developments in this area, (a subset of) the new directions taken and connections made, and some of the exciting “applications and ramifications” these spawned.

[slides], [video]

9:40-10:15 Ilias Diakonikolas

Optimal Distribution Testing via Reductions

The prototypical question in distribution property testing is the following: Given sample access to one or more discrete distributions, determine whether they have some global property or are far from having the property in \ell_1 distance. We will describe a simple unified framework to obtain sample-efficient testers in this setting, by reducing \ell_1-testing to \ell_2-testing. Using our framework, we obtain optimal estimators for a wide variety of \ell_1 distribution testing problems, including the following: identity testing to a fixed distribution, closeness testing between two unknown distributions (with equal/unequal sample sizes), independence testing (in any number of dimensions), closeness testing for collections of distributions, and testing k-flatness. For most of these problems, our approach gives the first optimal tester in the literature. Moreover, our testers are significantly simpler to analyze compared to previous approaches. As an important application of our reduction-based technique, we obtain the first adaptive algorithm for testing equivalence between two unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions — as opposed to merely their domain size — and is significantly better compared to the worst-case optimal tester in many natural instances. Moreover, our technique naturally generalizes to other metrics beyond the \ell_1-distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence tester under the Hellinger distance.

Joint work with Daniel Kane.

[slides], [video]

10:20-10:55 Jiantao Jiao

Three Approaches towards Optimal Property Estimation and Testing

Up to now, there exist three distinct methodologies that are provably achieving optimal estimation and testing performances for a wide range of statistical properties, including the Shannon entropy, mutual information, the Kullback—Leibler divergence, the total variation distance, among others. These three approaches have intimate connections, reflect key milestone ideas in statistics and machine learning, and have far reaching applications beyond distribution estimation and testing. We discuss the fundamental ideas behind these three approaches, their relative strengths and weaknesses, as well as advice for their usage in practice.

Based on joint work with Yanjun Han, Dmitri Pavlichin, Kartik Venkat, and Tsachy Weissman.

[slides], [video]

Coffee break
11:25-12:00 Alon Orlitsky

A Unified Maximum Likelihood Approach for Estimating Symmetric Distribution Properties

Symmetric distribution properties such as support size, support coverage, entropy, and proximity to uniformity, arise in many applications. Specialized estimators and analysis tools were recently used to derive asymptotically sample-optimal approximations for each of these properties. We show that a single, simple, plug-in estimator—profile maximum likelihood (PML)—is sample competitive for all symmetric properties, and in particular is asymptotically sample-optimal for all the properties above.

Joint work with Jayadev Acharya, Hirakendu Das, and Ananda Theertha Suresh.

[slides], [video]

12:00-12:25 Gautam Kamath

Testing with Alternative Distances

Traditionally, distribution testing has focused on testing with respect to the total variation distance. In this talk, I will discuss some results on distribution testing with other distances, including \chi^2, Kullback-Leibler, Hellinger, and \ell_2. I'll also go into motivation for testing with other distances, including applications to testing problems both new and old (i.e., testing independence and monotonicity), and allowing for tolerance to model misspecification.

Based on joints works with Jayadev Acharya, Constantinos Daskalakis, and John Wright.

[slides], [slides (pdf)], [video]

Lunch break
14:55-15:30 Costis Daskalakis

High-Dimensional Distribution Testing

How many samples from a multi-dimensional distribution are necessary to distinguish whether it is a product measure or whether it is 10%-far in total variation distance from being product? As it turns out, answering this question rigorously requires exponentially many samples in the dimension. Similar lower bounds apply to a host of statistical testing problems in high dimensions. So what do we really know about high-dimensional distributions and the important phenomena that they model? I will propose a way out of the conundrum with an overview of recent work on testing structured high-dimensional distributions: Bayesian networks and Markov Random Fields. A combination of information-theoretic and statistical physics techniques will yield efficient testing from a number of samples that is a low polynomial in the dimension, and in some cases even just a single sample from the underlying distributions.

Based on joint work with Nishanth Dikkala, Gautam Kamath, Qinxuan Pan.

[slides], [slides (pdf)], [video]

15:35-16:10 Ryan O'Donnell

Distribution testing in the 21½^th century

Let \mathrm{p} be an unknown source of randomness with n basic outcomes. We're interested in the usual questions — e.g., the number of samples required to fully learn \mathrm{p}, to test whether \mathrm{p} is close to some fixed hypothesis \mathrm{q}, to estimate the entropy of \mathrm{p}, etc. Recently, sharp upper bounds have been found for these problems (e.g., O(n^2/\varepsilon^2), O(n/\varepsilon^2), O(n^2/\varepsilon + \log^2 n / \varepsilon^2) for the aforementioned problems). Did I mention that \mathrm{p} is a quantum state, the noncommutative cousin of an n-outcome probability distribution? We're learning and testing quantum states.

In this talk we'll survey techniques used in the area. Sometimes it's the “usual thing”: probabilistic analysis of random histograms like Young tableau (5,3,2) ; or, collision-tester/unbiased-estimator/variance-analysis for testing if p is the maximum-entropy distribution. Other times we'll need to delve into diverse older topics — going back in time to The Art of Computer Programming Vol. 3 (Sorting and Searching), or even further back in time to the representation theory of the symmetric group.

[slides], [slides (pdf)], [video]

16:15-16:50 Ronitt Rubinfeld

Sampling Correctors

In many situations, sample data is obtained from a noisy or imperfect source. In order to address such corruptions, we propose the methodology of sampling correctors. Such algorithms use structure that the distribution is purported to have, in order to allow one to make “on-the-fly” corrections to samples drawn from probability distributions. These algorithms may then be used as filters between the noisy data and the end user. We show connections between sampling correctors, distribution learning algorithms, and distribution property testing algorithms. We show that these connections can be utilized to expand the applicability of known distribution learning and property testing algorithms as well as to achieve improved algorithms for those tasks.
Warning: This talk contains more questions than answers...

Joint work with Clément Canonne and Themis Gouleakis.

[slides], [slides (pdf)], [video]

Coffee break
17:20-17:55 Tom Gur

Proofs of Proximity for Distribution Testing

We initiate a study of proofs of proximity for properties of distributions, which are proof systems within the framework of distribution testing. We investigate the power and limitations of several types of these proof systems, including the distribution testing analogues of NP, MA, and IP. In particular, we show that proof systems can significantly reduce the complexity of testing natural properties of distributions.

Joint work with Alessandro Chiesa.

[slides], [video]

18:00-18:20 Open Problems

Speakers

: Clément Canonne will/may be graduating from Columbia University in September 2017, where his advisor is Rocco Servedio. His research focuses on the fields of property testing and sublinear algorithms; specifically, on understanding the strengths and limitations of the standard models in property and distribution testing, as well as in related areas. He also really likes elephants.
: Constantinos Daskalakis is an associate professor of computer science and electrical engineering at MIT. He holds a diploma in electrical and computer engineering from the National Technical University of Athens, and a Ph.D. in electrical engineering and computer sciences from UC-Berkeley. His research interests lie in theoretical computer science and its interface with economics, probability, learning and statistics. He has been honored with the 2007 Microsoft Graduate Research Fellowship, the 2008 ACM Doctoral Dissertation Award, the Game Theory and Computer Science Prize from the Game Theory Society, the 2010 Sloan Fellowship in Computer Science, the 2011 SIAM Outstanding Paper Prize, the 2011 Ruth and Joel Spira Award for Distinguished Teaching, the 2012 Microsoft Research Faculty Fellowship, and the 2015 Research and Development Award by the Vatican Giuseppe Sciacca Foundation. He is also a recipient of Best Paper awards at the ACM Conference on Economics and Computation in 2006 and in 2013.
: Ilias Diakonikolas is an Assistant Professor and Andrew and Erna Viterbi Early Career Chair in the Department of Computer Science at USC. He obtained a Diploma in electrical and computer engineering from the National Technical University of Athens and a Ph.D. in computer science from Columbia University where he was advised by Mihalis Yannakakis. Before moving to USC, he was a faculty member at the University of Edinburgh, and prior to that he was the Simons postdoctoral fellow in theoretical computer science at the University of California, Berkeley. His research is on the algorithmic foundations of massive data sets, in particular on designing efficient algorithms for fundamental problems in machine learning. He is a recipient of a Sloan Fellowship, an NSF Career Award, a Google Faculty Research Award, a Marie Curie Fellowship, the IBM Research Pat Goldberg Best Paper Award, and an honorable mention in the George Nicholson competition from the INFORMS society.

Tom Gur is a postdoc in the EECS department at UC Berkeley, where he is hosted by Alessandro Chiesa. He received his Ph.D. from the Weizmann Institute of Science, where his advisor was Oded Goldreich. His research focuses on property testing, sublinear algorithms, probabilistic proof systems, and coding theory; he also likes to study all elements of the power set of these fields.

Gautam Kamath is a final-year graduate student at MIT, advised by Constantinos Daskalakis. His research focuses broadly on theoretical machine learning and statistics, and more specifically on distribution learning, testing, and applied probability. He also likes elephants, but prefers pandas.

Jiantao Jiao is a final-year graduate student at Stanford University, advised by Tsachy Weissman. His research focuses on high-dimensional statistics, theoretical machine learning, information theory, and applied probability. More specifically, he is intrigued by the fundamental limits of data analysis, and the questions of how to identify, compute, estimate, and test those limits in practical applications with the least amount of samples and computational efforts.

Ryan O'Donnell is a Professor of Computer Science at Carnegie Mellon University. He received his Ph.D. from MIT. Before joining CMU, he was a postdoc at IAS and Microsoft Research. His research interests include complexity theory, approximation algorithms, analysis of Boolean functions, learning theory, property testing, quantum computing and information, and probability. His awards include best paper and best student paper at CCC, and a Sloan Research Fellowship.

Alon Orlitsky is the Qualcomm Professor for Information Theory and its Applications at UCSD. He received a Ph.D. from Stanford University. Before joining UCSD, he spent time at AT&T Bell Labs' Mathematical Sciences Research Center and D.E. Shaw and Company. His research concerns information theory, compression, communication, probability estimation, prediction, machine learning, and speech recognition. His awards include the 1982 ITT International Fellowship, 1992 IEEE W.R.G. Baker award, 2006 IEEE Information Theory Paper Award, 2015 NIPS Best Paper Award, and 2017 ICML Best Paper Award, Honorable Mention.

Ronitt Rubinfeld joined the MIT faculty in 2004, and is on the faculty at the University of Tel Aviv. Her research interests include randomized algorithms and computational complexity. She co-initiated the fields of Property Testing and Sub-linear time algorithms, providing the foundations for measuring the performance of algorithms that analyze data without looking at all of it. Her work on Linearity Testing has helped bridge between Computational Complexity, Analysis of Boolean Functions, and Additive Combinatorics. Rubinfeld has been an ONR Young Investigator, a Sloan Fellow, an invited speaker at the 2006 International Congress of Mathematicians, and is an ACM Fellow.

Organizers and support

This workshop was organized by Clément Canonne and Gautam "G" Kamath, with the support of the FOCS Tutorial and Workshop chairs James R. Lee and Aleksander Mądry.

Frontiers in Distribution Testing

FOCS 2017 Workshop: Saturday, October 14 (Berkeley)

Overview

Open Problems

Schedule (Tentative)

Introductory remarks: a sample of what to expect

Optimal Distribution Testing via Reductions

Three Approaches towards Optimal Property Estimation and Testing

A Unified Maximum Likelihood Approach for Estimating Symmetric Distribution Properties

Testing with Alternative Distances

High-Dimensional Distribution Testing

Distribution testing in the 21½th century

Sampling Correctors

Proofs of Proximity for Distribution Testing

Speakers

Organizers and support

Distribution testing in the 21½^th century