Toy Models of Combinatorial Interpretability

Tuesday, April 14, 2026 - 4:15pm to 5:15pm
Refreshments: 
4:00 PM
Location: 
32-G449 (Kiva/Patil)
Speaker: 
Nir Shavit (CSAIL, EECS)
Biography: 
Nir Shavit is a Professor of EECS at MIT and member of technical staff at Red Hat AI. He is co-author of The Art of Multiprocessor Programming and recipient of the Gödel and Dijkstra Prizes for foundational work in distributed and concurrent computing. Over the past decade, Shavit shifted focus to computational connectomics—extracting and analyzing the wiring diagrams of biological neural tissue—driven by the belief that understanding how brains compute sparsely and efficiently holds the key to better AI. This work led naturally to combinatorial interpretability, the new ML research methodology at the heart of this talk.

We introduce combinatorial interpretability, a methodology that offers a sandbox for understanding neural computation by analyzing the combinatorial structures in the sign-based categorization of a network's weights and biases. We demonstrate its power through feature channel coding, a theory that explains how neural networks compute Boolean expressions and potentially underlies other categories of neural network computation. According to this theory, features are computed via feature channels: unique cross-neuron encodings shared among the inputs the feature operates on. Because different feature channels share neurons, the neurons are polysemantic and the channels interfere with one another, making the computation appear inscrutable. We show how to decipher these computations by analyzing a network's feature channel coding, offering complete mechanistic interpretations of several small neural networks that were trained with gradient descent. Crucially, this is achieved via static combinatorial analysis of the weight matrices, without examining activations or training new autoencoding networks. It also allows us for the first time to exactly quantify and explain the relationship between a network’s parameter size and its computational capacity (the set of features it can compute with low error), a relationship that is implicitly at the core of many modern scaling laws.