Testing Probability Distributions Using Conditional Samples

Wednesday, June 12, 2013 - 1:30pm to 2:30pm
Location: 
32-G575
Speaker: 
Clement Canonne
Biography: 
Columbia University

One of the most fundamental problem paradigms in statistics is that of inferring some information about an unknown probability distribution D, given access to independent samples drawn from it. More than a decade ago, Batu et al. initiated the study of problems of this type from within the framework of /property testing/ — a setting where, given an unknown "massive object" an algorithm can access only by making a small (sublinear) number of "local inspections" (queries) of the object, the goal is to determine whether the object has a particular property. The algorithm must accept if the object has the desired property, and reject if the object is far from every object with the property. In distribution property testing, the "massive object" is an unknown probability distribution D over an N-element set, and the tester accesses the distribution by drawing independent samples from it.

We study a new framework for such a task, by considering distribution testing algorithms that have access to a /conditional sampling oracle/. This is an oracle that takes as input a subset S of the domain [N]={1,...,N} of the unknown probability distribution D and returns a draw from the conditional probability distribution D restricted to S. This new model allows considerable flexibility in the design of distribution testing algorithms; in particular, testing algorithms in this model can be adaptive.

We study a wide range of natural distribution testing problems in this new framework and some of its variants, giving both upper and lower bounds on query complexity. These problems include testing whether D is the uniform distribution U; testing whether D = D* for an explicitly provided D*; testing whether two unknown distributions D1 and D2 are equivalent; and estimating the variation distance between D and the uniform distribution.

At a high level our main finding is that the new "conditional sampling" framework we consider is a powerful one: while all the problems mentioned above have Omega(N^(1/2)) sample complexity in the standard model (and in some cases the complexity must be almost linear in N), we give poly(log N, 1/eps)-query algorithms (and in some cases poly(1/eps)-query algorithms independent of N) for all these problems in our conditional sampling setting.

Joint work with Dana Ron (Tel-Aviv University) and Rocco Servedio (Columbia University).