Jiantao Jiao: Instance-optimal learning of the total variation distance

Thursday, November 16, 2017 - 4:00pm to 5:00pm
Jiantao Jiao

The total variation distance (statistical distance) plays fundamental roles in statistics, machine learning, and theoretical computer science. We consider the problem of learning the total variation distance between $p$ and $q$ for any fixed $q$ given independent identically distributed samples from $p$. We characterize the optimal sample complexity in terms of $q$ and $\epsilon$, and construct a near-linear time algorithm that achieves the optimal sample complexity for each $q$. In other words, the learner we constructed is instance optimal, paralleling the work of VV'14 on instance optimal identity testing. The dependence on $q$ in learning the total variation distance is drastically different from that in testing identity. We show that for any fixed $q$, the performance of the optimal learner with $n$ samples is essentially that of the plug-in approach with $n \log n$ samples, where the plug-in approach evaluates the total variation distance between the empirical distribution of $p$ and the known distribution $q$.

The key methodology behind our achievability and lower bound constructions is the idea of local approximation, which bridges approximation theory and concentration inequalities. In particular, the upper and lower bounds are dual to each other, and proofs of upper bounds can be translated into proofs of lower bounds.

We generalize the local approximation approach to learning the total variation distance when both $p$ and $q$ are unknown. We obtain tight upper and lower bounds of the optimal sample complexity including the dependence on $\epsilon$, where we show it is necessary to utilize multivariate polynomial approximation: any univariate polynomial approximation scheme fails to achieves the optimal sample complexity. We show that the sample complexity for the both $p,q$ unknown case is essentially the same as the $q$ known case with $q$ being uniform, showing that $q$ being uniform is the most difficult case.

If time permits, we discuss the potential applications of the general local approximation methodology, which has proved to produce the optimal learners in a variety of settings such as learning the entropy, power sum functional, KL divergence, Hellinger divergence, $\chi^2$ divergence, support size, etc.

Based on joint work with Yanjun Han and Tsachy Weissman. (https://arxiv.org/abs/1705.00807)