Jelani Nelson: Heavy Hitters via Cluster-Preserving Clustering

Tuesday, November 15, 2016 - 4:00pm to 5:00pm
Refreshments: 
Cookies and Milk at 3:45pm
Location: 
Patil/Kiva G449
Speaker: 
Jelani Nelson, Harvard
Abstract:  In the "heavy hitters" or "frequent items" problem, one must process a
stream of items and report those items that occur frequently. For
example, a telecommunications company may wish to find popular
destination IP addresses in a packet stream across one of their links,
or a search engine may wish to report popular query words. A more
general problem is when there are two streams and we must report those
items whose frequencies significantly deviate between them; the former
problem is a special case since we can artificially pretend the first
of the two streams the empty stream. Such a problem naturally arises
in trend detection and anomaly detection. Several algorithms were
known in the literature solving these problems, such as the CountMin
sketch, CountSketch, Hierarchical CountSketch and others. The goal in
designing such algorithms is (1) to guarantee finding frequent items
for as lax a definition of "frequent" as possible while still limiting
output size, and while having low (2) memory consumption, (3)
processing time per stream item, (4) query time to report the list of
frequent times, and (5) failure probability. Previous solutions could
perform well on various subsets of these metrics, but not on all 5
simultaneously.

We design a new algorithm, ExpanderSketch, which performs well on all
5. Our main innovation is a novel reduction to a new graph-clustering
problem we formulate, in which finding most of every cluster
guarantees finding all the frequent items. We then solve this problem
by devising a novel spectral clustering algorithm potentially of
independent interest, based on divide-and-conquer and local search.