THESIS DEFENSE: Sepideh Mahabadi: Sub-linear Algorithms for Massive Data Problems

Tuesday, May 16, 2017 - 1:00pm to 2:00pm
32-D463 (Star)
Sepideh Mahabadi
The recent availability of massive data sets has had a significant impact on the design of algorithms. This has led to the emergence of new computational models that capture various aspects of massive data computation. In this talk, I will provide a brief summary of my work in this area. In the first part of the talk, we will look at the search problem which is one of the most basic tasks in data processing and analysis. This is often formalized as the (Approximate) Nearest Neighbor (ANN) problem. Given a collection P of n points in a d-dimensional space, the goal of the ANN problem is to build a data structure, that, given a query point q, in sub-linear time, reports a point from P which is (approximately) closest to q. This classic problem has numerous applications in information retrieval, image and video databases, clustering, and in many other fields. Despite an extensive amount of research on this topic, the basic formulation of Nearest Neighbor encounters several challenges in many applications which we address here. This includes dealing with the case where the data is corrupted or incomplete, handling multiple related queries, and handling a data set of lines as opposed to points. In the second part of the talk, I will discuss a general approach for solving massive data problems. We introduce the notion of Composable Coresets, defined as small summaries of multiple data sets that can be aggregated together to summarize the whole data. We show how to compute such summaries for several clustering problems, and at the same time, demonstrate that no such summaries are possible for other problems such as maximum coverage. Therefore, we look at the coverage problems in alternate sub-linear models such as streaming and sub-linear time models. One such problem is the classical Set Cover for which we present tight bounds in these models.