A Research Programme coordinated by PrincInt

Physics Inspired
Ambitious
Mechanistic Interpretability

We believe that ambitious mechanistic interpretability is possible, and we believe physics is the answer. A collaboration at the intersection of statistical physics, deep learning theory, and AI safety.

Two Histories,
One Moment

To understand where mechanistic interpretability stands today, it helps to place it alongside a surprisingly close historical parallel: the development of thermometry in the nineteenth century. Both fields faced the same foundational impasse. Both needed a Thomson.

A History of Mechanistic Interpretability

A History of Thermometry

First Wave · 2012–2021
Neurons as Variables

There was initial skepticism that neural networks contained anything resembling a structured deductive process — the prevailing view was that they were just bags of heuristics. But early interpretability researchers proved the skeptics wrong. By optimizing over the input space to maximally activate individual neurons, they found genuine interpretable structure inside deep networks.

Early neurons responded to curves and edges. Moving further down the layers, these features composed into progressively more complex structures — eventually producing a multi-dimensional dog detector that stared back at you no matter which angle you approached it from. Mechanistic interpretability was born as a project of reverse engineering: reading off the algorithm from the weights.

Anomaly: Some neurons responded to cats, cars, and cat legs simultaneously. This implied networks represented more variables than neurons. The atomic unit of the theory was broken.
First Instruments · c.1600–1780s
Building Before Understanding

The thermometer was invented before anyone had a principled theory of what temperature was. Early instruments — Galileo's thermoscope, Fahrenheit's mercury thermometer, Celsius's scale — were built on intuition and pragmatic convention. They worked well enough for engineering purposes.

But there was no theoretical basis for why one scale should be preferred over another, or what it meant, in physical terms, to say that something was "hotter." For nearly two centuries, thermometry was an empirical craft. Scales were calibrated to reproducible physical events — the freezing and boiling of water — but these were arbitrary reference points, not fundamental ones. The instruments got better. The theoretical confusion remained.

Parallel: Both fields built instruments that worked before anyone could say what they were actually measuring. Pragmatic success masked a foundational vacuum.
Second Wave · 2022–2025
Superposition & the Engineering Turn

The polysemantic anomaly gave birth to the Superposition Hypothesis: a dense network is noisily simulating a much larger, sparser, disentangled model underneath. This directly inspired Sparse Autoencoders (SAEs) — train an autoencoder with hidden dimension larger than input under a sparsity constraint, and decompose residual stream activations into interpretable features.

It worked. Features were largely monosemantic. You could stimulate them and get predicted results. A declaration of victory was made — interpretability was now largely an engineering problem. From 2023–2025, the field became an engineering culture that hill-climbed on benchmarks through slight architectural adjustments to SAEs.

Anomalies: Feature splitting — make the SAE wider, the atom splits again. Feature absorption — hierarchical concepts merge under sparsity pressure. Non-linear representations, feature geometry, and distributed representations all fell outside the superposition framework entirely.
The Impasse · 1840s
Regnault's Problem

The crisis came to a head when Henri Victor Regnault showed that different gas thermometers gave slightly different readings of the same physical phenomenon. The differences were small, systematic, and reproducible.

This was an impasse with no empirical resolution. To determine which thermometer gave the correct reading, you needed a temperature standard. But a temperature standard was precisely what was in dispute. Every proposed empirical procedure was circular — it already assumed what it was trying to establish.

The core problem: How do you define temperature without being able to measure it? How do you measure it without already having a definition? Regnault had shown the purely empirical path had hit a wall. Something theoretical had to give.
Third Wave · 2025–Now
Pessimism, Pragmatism, and the Fork

A significant and growing faction has updated not on the inadequacy of current tools, but on the entire ambitious project. Their conclusion: no amount of theoretical progress will deliver high-assurance guarantees, and the rational response is to pivot toward methods with immediate empirical payoffs on proxy tasks.

This pragmatist position is not without merit — production-ready probes making deployed systems safer today are genuinely valuable. But PIAMI holds that the pragmatists have overcorrected. The tools are insufficient not because the project is doomed, but because the field lacks what has always unblocked this kind of impasse: a principled theoretical foundation. SAEs were designed to capture sparsity — optimizing for it crowds out compositionality, hierarchy, and scale. The field remains conceptually confused about what a feature even is. These are not engineering problems. They are scientific ones.

The Thomson Moment · 1848
Grounding Temperature in Theory

William Thomson resolved the impasse not by collecting more data, but by reconceptualizing the problem. Rather than defining temperature in terms of any physical substance — caloric, mercury, a particular gas — he defined it via Carnot efficiency: the maximum work extractable from a heat engine operating between two reservoirs.

This replaced an unanswerable question ("which gas is correct?") with a tractable one ("which gas most closely approximates ideal behavior as specified by Carnot's engine?"). The latter could be answered by measurement. The result was the absolute thermodynamic temperature scale — grounded not in any substance, but in the fundamental laws of thermodynamics. Theory had unblocked measurement. The close interplay between theoretical development and instrument design that followed produced thermodynamics as we know it.

Mechanistic Interpretability Is Ready for Its Thomson Moment

The parallel is precise. Mechanistic interpretability today is where thermometry was in the 1840s. We have instruments — SAEs, probes, activation patching — that work well enough for engineering purposes. We have anomalies that the current theoretical framework cannot explain. And we have a faction of the field arguing that the anomalies prove the project is hopeless, that we should retreat to purely pragmatic methods.

But this is exactly the moment, in the history of thermometry, just before Thomson. The empirical path had hit a wall not because the project was doomed, but because it needed theoretical grounding. What ended the impasse was not more data or better instruments — it was a principled reconceptualization of what temperature fundamentally is, grounded in a theory that could make falsifiable predictions and guide measurement.

PIAMI's thesis is that mechanistic interpretability is at exactly this juncture. The field is not pre-paradigmatic — it has a rich history of genuine progress. It is mid-paradigmatic, in the middle of a normal and healthy cycle of epistemic iteration. What it needs now is not a retreat to pragmatism, but its Thomson moment: a principled theoretical foundation, grounded in statistical physics, that reconceptualizes what a "feature" fundamentally is and makes the questions that currently seem unanswerable become tractable.

Why Physics →

Two Arguments for Physics

The case for physics-inspired interpretability rests on two distinct but reinforcing claims: that a scientific theory of deep learning is already emerging from the physics community, and that statistical physics has developed exactly the tools needed to address interpretability's deepest unsolved problems.

A Scientific Theory of Deep Learning Is Emerging

ML algorithms differ from ordinary CS algorithms in that they are continuous, dynamical, have many emergent sources of randomness, and have highly parallel interaction structures — hallmarks of statistical physical systems. Physicists have been studying systems like this for over a century.

A growing body of work is now demonstrating that this analogy is not merely aesthetic. Simon, Kunin et al. (2026) argue that five distinct and converging research programmes — solvable idealized settings, tractable limits, simple mathematical laws, theories of hyperparameters, and universal behaviors — are together constituting a genuine scientific theory of deep learning, which they call learning mechanics.

"A scientific theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks." — Simon et al., 2026

This theory is concerned with training dynamics, coarse aggregate statistics, and falsifiable quantitative predictions — the same standards of evidence that physics has always demanded. And it anticipates a symbiotic relationship with mechanistic interpretability.

Read: There Will Be a Scientific Theory of Deep Learning ↗

Solvable Idealized Settings

Toy models — deep linear networks, single-index models, sparse parity — are simple enough to analyze exactly but rich enough to exhibit key phenomena of realistic training dynamics.

Tractable Limits

Infinite-width, infinite-data, and mean-field limits reveal fundamental learning phenomena — the NTK, feature learning phase transitions, and DMFT dynamics — that persist qualitatively in finite systems.

Simple Mathematical Laws

Neural scaling laws, grokking dynamics, and edge-of-stability phenomena obey clean power-law and phase-transition signatures — macroscopic observables that constrain theory without requiring full microscopic description.

Theories of Hyperparameters

The μP parameterization and its extensions disentangle learning rate, width, and depth from the underlying dynamics — making large-scale training predictable and transferable.

Universal Behaviors

Similar phenomena — hierarchical learning, lazy-to-rich transitions, feature emergence — appear across architectures, tasks, and scales. Universality is the signal that a deeper theory is at work.

Renormalization Addresses Interpretability's Core Problems

The deepest unsolved problems in mechanistic interpretability — the long tail of learned heuristics, the impossibility of proving completeness, the absence of scale-aware feature decomposition — are not merely engineering challenges. They are versions of problems that physics has confronted and developed rigorous tools to handle.

Greenspan, Brill, Lin, Mack, Teixeira, Vaintrob et al. (2026) argue that the renormalization framework from statistical physics offers a precise language and set of design constraints for ambitious interpretability. Neural networks organize information according to the hierarchical, multi-scale structure of natural data — and interpretability methods should be similarly scale-aware.

Renormalization formalizes three things current methods handle poorly: scale (granularity or resolution), relevance (which degrees of freedom matter at a given scale), and coarse-graining (how irrelevant degrees of freedom are systematically ignored).

The goal is not to claim that neural networks are renormalizable in a strict field-theoretic sense, but that renormalization suggests concrete questions and failure modes for interpretability methods — and provides the tools to address them.

Read: Towards Worst-Case Guarantees with Scale-Aware Interpretability ↗

Scale · Finding Natural Resolutions

Current methods (SAEs, probes) operate at a fixed resolution and have no principled way to track how features compose across scales. Renormalization group methods identify natural scales for coarse-graining features during both training and inference.

Relevance · Scale-Dependent Feature Importance

Which features matter depends on the resolution at which you ask the question. Physics provides a principled, scale-dependent notion of feature relevance — a notion entirely absent from current interpretability tooling.

Separation of Scales · Worst-Case Guarantees

The long tail problem and the difficulty of proving completeness both stem from the same issue: no principled way to bound the influence of fine-grained structure on coarse-grained behavior. Statistical physics provides exactly this — diagnostics and guarantees for when fine-grained fluctuations can be safely ignored at a chosen level of abstraction.

Causal Separation · Robustness Across Environments

Safety assessments must be robust to distributional shift. Physics-informed causal separation — distinguishing structure that is invariant across scales from structure that is scale-specific — provides a framework for building interpretability tools that generalize out of distribution.

Building the Basic Science

We are interested in developing theoretical frameworks that make increasingly accurate predictions about models trained on natural data — including their internal representations and inference algorithms. This requires understanding how the structure of data interplays with learning algorithms and learned representations — and translating that understanding into practical safety tools.

Guiding Question

Can we construct idealized models rich enough to capture intrinsic properties of data structure (hierarchical, compositional, sparse, sequential) while remaining tractable enough to make quantitative predictions? What observables do these models share with natural data, and how do they constrain which theories of data structure are actually useful?

Scaling laws have been a key observable for physics-inspired approaches. Cagnetta 2026 manage to quantitatively predict exponents of data-limited scaling laws by looking at two quantities: how pairwise token correlations decay as they become further apart, and how the next-token conditional entropy decays with respect to the length of the conditioning context.

Coppola et al. 2025 take this physics analogy further, developing a renormalization group (RG) framework for learning curves of weakly non-linear networks trained on power-law distributed data. By treating training as an RG flow that progressively integrates high-frequency modes of the data, they analyze the self-similarity and universality of scaling laws. A notable finding is that features typically neglected in standard treatments — such as the discreteness of the data spectrum and lack of translation invariance — lead to both quantitative and qualitative departures from conventional perturbative RG predictions.

Brill 2024 introduces a percolation model of natural data to study scaling laws, showing that hierarchical, sparse structure gives rise to power-law scaling regimes. Brill 2025 extends this to representation learning, using the same random lattice model to categorize learned features by their compositional role of context, component, and surface.

In modelling language specifically, Cagnetta 2024a and 2024b introduce Probabilistic Context Free Grammars (PCFGs) and use them to study how latent hierarchical tree-like structure in grammar governs token-to-token correlations, and how the effective range of these correlations is governed by training set size. Wyart 2025 utilize diffusion models to probe latent hierarchical structure in natural data.

The modeling of sequential data can be found in computational mechanics (Shalizi and Crutchfield 1999), which in a hidden Markov model setting studies minimal sufficient statistics (MSS) for optimal generation and prediction. More recently, Rosas et al. 2025 extends this to track MSS for input-output processes, giving a natural formalism to study (PO)MDPs — the classical setting of reinforcement learning agents.

Wentworth 2025 explores this in a Bayesian setting, showing that if two agents agree on predictions over a pair of observables but use different latent variables, any two such "natural latents" are isomorphic — with results robust to approximation. The conditions are that the latent must mediate between the observables and be recoverable from either one individually.

Eisenstat 2025 generalises this to the case where an agent's world-model consists of a structured family of latent variables. Under a condition called perfect condensation, any two such latent families over the same observables must correspond, with an approximate version of the correspondence theorem holding more generally via an information-theoretic inequality.

Guiding Question

How do different training phenomena — hierarchical learning, phase transitions, multi-scale dynamics — interact with data structure to shape internal representations? When can the noise generated by training be treated as a perturbative correction, and how does this interact with noise from initialization or finite sample size?

NNFT uses mean-field theory, treating the GP component as a higher-order correction to an adaptive background. The mean-field background can encode arbitrary correlations between layers and neurons — getting around NNGP++ expressivity limits. This suggests viewing "circuits" as shifts in the Bayesian posterior's kernel structure rather than fixed computational structures.

The framework has genuine limitations. The space of possible distributions is a priori infinite-dimensional. Restricting to "tilts" (exponential families) improves tractability, reducing the saddle-point problem to finding a minimizer on a finite-dimensional space — but identifying the right parameterization of this tilt space for transformer-like architectures remains an open problem.

The mean-field description assumes neurons are statistically exchangeable — each drawn i.i.d. from a shared prior. This breaks down when a small number of neurons are highly specialized, playing qualitatively distinct roles that cannot be captured by any single distribution over rows. In physics language, these are instantons: non-perturbative, localized configurations outside the saddle-point + Gaussian fluctuation picture entirely.

Developing a unified theory that handles both the bulk and instanton-like specialized minority is a key future direction for interpretability. Such cases may be better described by Saxe-inspired analyses of linear network dynamics, where specialization emerges through singular value structure.

If NNFT is the right asymptotic description of the structures we care about, it gives us the right objects to reason about. Instead of asking why specific weights have specific values — a question with no clean answer — we ask what self-consistent distribution over neurons the network has converged to, and what computations that distribution implements.

This is a better-posed question, and one that makes contact with the circuit-level descriptions that interpretability already uses, but grounds them in a principled theory of why those circuits form, when they form, and how they scale.

Singular Learning Theory (SLT) paints an idealized picture of model behavior controlled by local geometric structure. The Real Log Learning Coefficient (RLCT) governs the geometry of the posterior near a local minimum; MFT describes the structure of that posterior as a saddle-point approximation.

Whether these descriptions are compatible, and whether the RLCT can be computed or bounded by MFT analysis, is an important open direction. Pursuing this would lend increased theoretical support for why SLT observables track circuit structure.

Theories of learning dynamics treat noise in two ways. The deterministic picture (Saxe et al. 2014, Abbe et al. 2023) studies discrete symmetry-breaking events — saddle-to-saddle transitions where different singular modes of the target function appear in sequence. The stochastic picture (Bordelon & Pehlevan 2022, 2026) uses DMFT to formulate the kernel as a dynamical object during training.

The circumstances under which one perspective provides a better theoretical explanation than the other remains open. Incorporating finite learning rate as a thermodynamic control parameter into DMFT frameworks — where it would enter alongside width, depth, and initialization scale as a regulator of the feature-learning regime — is also an important open direction.

On the elicitation end, Yue 2025 shows that base models achieve higher pass@k at large k on math, reasoning, and coding benchmarks than their RLVR fine-tuned counterparts, suggesting that original reasoning abilities originate from and are bounded by the base model. On the discovery end, Bush et al. 2025 provides mechanistic evidence that model-free RL algorithms can learn policies resembling system 2 reasoning.

Further investigation into the learnability of different policies and the characterization of their safety profiles is warranted. Hazan et al. 2025 propose a research program investigating learnability of stochastic processes by treating them as dynamical systems and studying their stability, mixing, observability, and spectral properties.

Guiding Question

How can we classify the different kinds of representations, circuits, and algorithms learned in neural networks? What does each framework treat as its fundamental object and implicit complexity measure? How do theories of learning and representations inform interpretability tools?

Early work in NTK and Gaussian process models created a paradigm of feature learning as a physically-aligned model of representations, classifying inputs by their PCA in activation space. While valuable for lazy learning and NNGP++ paradigms, this approach works poorly for interpreting semantic features or mechanisms in sophisticated models.

A richer picture treats activation data as a kernel matrix — the matrix of activation dot products in the data basis — rather than raw PCA features. In some sense this is a vacuously general object: any two neural nets with the same activation dot products are equivalent from the point of view of training and Bayesian learning. Linking the data kernel matrix to more atomic and interpretable features beyond PCA has not yet been done in a satisfying way.

Transformers trained on text generated by hidden Markov models learn to linearly represent optimal belief states, and the representations and mechanisms are understood in quite a bit of detail in simple cases (Piotrowski et al. 2025). Any text generation process can be approximated arbitrarily well with a finite HMM, making this a clean playground for understanding representations with hidden context variables.

While the optimal belief state propagation is deterministic, the emergent algorithm has interesting fractal and multiscale properties which may be linked to multiscale structure and noise in realistic models. This connection has not yet been fleshed out.

Production models are often trained with initialization lengths that interpolate between lazy and feature learning regimes — the mixed-feature learning regime. It is currently unclear how to interpret the significance of training noise in this regime.

One interpretation holds that the mixed regime is in spirit the same as the feature learning regime, with initialization length only governing the sharpness of the sigmoid accuracy curve. Another interpretation asserts that computation in this regime is qualitatively different, with results emerging from noisy circuits — as seen in mean-field predictions for modular addition (Rubin et al. 2024) and the de-noising task (Vaintrob 2025).

Guiding Question

Can we design architectures and training procedures that produce representations amenable to mechanistic analysis from the outset? What inductive biases most reliably lead to interpretable internal structure, and how do we verify that interpretability is preserved as systems scale?

Tensor network architectures impose structured factorizations on weight matrices, making internal representations geometrically tractable. Weight-sparse models constrain the effective degrees of freedom, reducing the combinatorial complexity of circuit search. Both approaches attempt to build interpretable structure in from the outset rather than discovering it post hoc.

Gradient routing techniques selectively direct learning signals to specific subnetworks, encouraging functional specialization. This can induce modular structure not present under standard training, making circuits more identifiable. MELBO and related objective engineering approaches explicitly reward disentanglement, modularity, or sparsity of internal representations — building interpretability into the learning signal itself.

Guiding Question

Can we build unsupervised tools for reconstructing learned algorithms and representations from the weights and activations of deployed models — without requiring labeled ground truth? What theoretical guarantees can we provide about the completeness and faithfulness of such reconstructions?

SAEs decompose residual stream activations into sparse linear combinations of learned feature directions. While achieving notable successes, their reliance on sparsity as a proxy for feature identity competes with compositionality and hierarchical structure. SAEs are based on an incomplete data model — designed to account for sparsity, whose optimization competes with other properties we wish to capture.

Physics-informed alternatives may address this by decomposing the learned weight distribution into its saddle-point component (circuits) and Gaussian fluctuations (noise), rather than decomposing activations naively into sparse dictionaries.

Rather than decomposing activations into fixed dictionaries, manifold-based approaches characterize the geometry of activation spaces directly — identifying low-dimensional structure, curvature, and topological features that reflect the network's learned representations. This approach avoids the sparsity assumption entirely and may be more robust to the kinds of distributed, compositional representations that SAEs struggle with.

Direct analysis of weight matrices — via SVD, tensor decompositions, or graph-theoretic methods — aims to identify functional subnetworks without requiring activation data at all. This is particularly relevant for understanding circuits that are sparse in weight space rather than activation space, and may be more robust to distribution shift in the deployment setting.

Blog Posts & Essays

Long-form pieces from PIAMI researchers on the science, history, and practice of physics-inspired interpretability.

Upcoming & Past Events

We facilitate interdisciplinary research by running workshops at the intersection of statistical physics and AI safety.

April 13–17, 2026
Statistical Physics for Ambitious Interpretability Workshop

A workshop bringing together statistical physicists, deep learning theorists, and AI safety researchers to collaboratively develop the PIAMI research roadmap.

Coming Soon
Next PIAMI Event

Future workshops and events will be listed here. Sign up for our newsletter to be notified.

PrincInt &
the PIRAMID Division

PrincInt is an AI Safety field-building organization focused on supporting interdisciplinary collaborations aimed at providing high-assurance safety guarantees for AGI. We facilitate research collaborations, run an internal research division, incubate and fiscally sponsor new academic labs, act as a regrantor, and run a research fellowship program.

Our internal research division, PIRAMID, works directly on the PIAMI agenda.

ILIAD
Field Building
MoirAI
Max Hennick
XOR Labs
Fernando Rosas
Fields Institute — Center for Mathematical AI
Academic Partner
Simplex
Affiliated Lab
Principia
Andrew Saxe
LT
Lucas Teixeira
Director
DV
Dmitry Vaintrob
Research Scientist
AM
Andrew Mack
Research Scientist
NM
Nischal Mainali
Research Scientist
TC
Tom Carlson
Research Scientist
JL
Jennifer Lin
Research Scientist
AB
Ari Brill
Research Scientist
LG
Lauren Greenspan
Research Scientist

Join the Collaboration

PIAMI is an open collaboration. Whether you're a researcher, faculty member, or institution, there's a path for you to contribute.

🗺️

Shape the Agenda

Help us define the research roadmap. Your perspective shapes the direction of PIAMI.

🔬

Join the Team

If you're on the job market, join PrincInt or one of our affiliated organizations — we'll be hiring.

🏛️

Academic Faculty

We can help procure seed funding for graduate students at academic institutions working on PIAMI topics.

🚀

Start an Org

Considering leaving academia? We can help procure funding and fiscally sponsor an independent research org.

🎓

Mentor Fellows

Sign up to mentor researchers in our summer and winter fellowship programs.

🤝

Collaborate

Work directly on open items in the research roadmap. Reach out and we'll find the right fit.

Ready to piece the elephant together?

We're building the principled science that ambitious mechanistic interpretability has always needed. Come help write the next chapter.