Sanjeev Arora: The linear algebraic structure of word meanings

Tuesday, September 15, 2015 - 4:15pm to 5:15pm
Light Refreshments at 4pm
Patil/Kiva G449
Sanjeev Arora, Princeton University

Abstract:  In Natural Language Processing (NLP), semantic word embeddings are used to capture the meanings of words as vectors. They are  often constructed using nonlinear/nonconvex techniques such as deep nets and energy-based models. Recently Mikolov et al (2013) showed that such embeddings exhibit linear structure that can be used to solve "word analogy tasks" such as man: woman :: king: ??.  Subsequently, Levy and Goldberg (2014) and Pennington et al (2014) tried to explain why such linear  structure should arise in embeddings derived from nonlinear methods.

We provide a new generative model for language that gives a different explanation for how such linear algebraic structure arises.  This new model also casts new light on older methods for generating word embeddings, such as the PMI method of Church and Hanks (1990).  The model has surprising predictions, which are empirically verified. It also suggests a new linear algebraic method to detect polysemy (the word having different meanings; eg "bank" can mean a financial institution or the sloped edge of a river).

We think our methodology and generative model may be useful for other NLP tasks and understanding the efficacy of other neural models.

Joint work with Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski (listed in alphabetical order).