RouterInterp:

Understanding Superposed Specialisation in Mixture of Experts Routing

Ilya Lasy, Yinuo Nora Cai, Kola Ayonrinde

Sparse Mixture of Experts (MoE) models scale more efficiently than dense models by routing tokens to modular expert networks that are only active when relevant to the task. A leading hypothesis for the performance of MoE models is that each expert specialises in a single, coherent domain. However, interpretability efforts that assume this hypothesis have generally been unsuccessful. We propose and present evidence for an alternative account that we call the Superposed Specialisation Hypothesis (SSH): experts specialise in a disjoint union of fine-grained features rather than one broad domain. Leveraging the SSH, we introduce RouterInterp, a method for interpreting expert routing that identifies Sparse Autoencoder features most predictive of routing decisions and produces unified natural language explanations. On gpt-oss-20b, RouterInterp’s explanations predict expert routing with 57% higher explanation scores than prior methods based on token-level statistics. RouterInterp also outperforms explanations produced from expert activations without SAE decomposition. This work provides a scalable method for generating concise and more accurate explanations of expert routing and increases our understanding of a previously uninterpretable component of foundation models.

RouterInterp:

Understanding Superposed Specialisation in Mixture of Experts Routing

Arike et al: How does information access affect LLM monitors’ ability to detect sabotage?

Sidhu at al: Principles and Guidelines for AI Incident Monitoring and Reporting

TAIS 2026