Abstract: The model fitting problem in Topic Modeling is a special case of Non-Negative Matrix Factorization and both are computationally hard. Soft clustering, where, each datapoint can belong fractionally to several clusters is a useful tool for both. We make two main assumptions on the data - that each datapoint belongs dominantly to one cluster and each cluster has some dominant features and prove that an algorithm with 3 natural steps - Thresholding, SVD and k-means - does the soft-clustering in polynomial time. We use that to solve Topic Modeling and NMF with provable error guarantees which are better than known algorithms. We demonstrate good empirical performance of the algorithm as well as reasonableness of the assumptions.
Joint work with: T. Bansal, C. Bhattacharyya, N. Goyal, J. Pani