Harry Lang: Coresets for k-Means-Type Problems on Streams

Wednesday, June 28, 2017 - 4:00pm to 5:00pm
Harry Lang
Johns Hopkins

Let f be a non-negative symmetric dissimilarity measure. Given sets of points P (the input data) and C (a "query"), we define F(P,C) = \sum_{p \in P} \min_{c \in C} f(p,c). An example that fits into this framework is k-means clustering where f = distance^2 and we restrict |C| = k.

An $\epsilon$-coreset for P is a set Q such that F(Q,C) is within a factor of $(1 + \epsilon)$ of F(P,C) for any query C. For data reduction, the goal is to construct a coreset Q of minimal cardinality. We introduce improved coreset constructions for a wide class of functions that obey a weak form of the triangle inequality: this includes k-means, k-median, Tukey's bi-weight function, and the Huber loss function.

In the streaming model, the input P (a set of n points) arrives sequentially and algorithms attempt to maintain a coreset using a minimal amount of memory. We introduce a new technique for maintaining coresets in the streaming model with O(1) overhead. For example, the previous state-of-the-art for maintaining an $\epsilon$-coreset for metric k-means on a stream required the storage of $O(\epsilon^{-4} k \log k \log^6 n)$ points whereas our method requires $O(\epsilon^{-2} k \log k \log n)$ points.

Joint work with Vladimir Braverman and Dan Feldman.