How does information access affect LLM monitors’ ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward

“We study how information access affects automated LLM monitors’ ability to detect sabotage, by evaluating various monitors in AI control environments. We find that monitors are often subject to the less-is-more effect—monitors that cannot access the attacker’s entire transcript often per form better than monitors that can. We introduce a new monitor architecture, Extract-and-Evaluate monitoring, that leverages this effect.”

Frontier language model agents can exhibit misaligned behaviors, such as exploiting reward hacks and pursuing hidden objectives. To control these agents, we can use LLMs to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance and find that monitors often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate and analyze this effect in three evaluation environments where agents must conduct sabotage while evading monitors. Leveraging this effect, we introduce extract-and-evaluate (EaE) monitoring—a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent’s trajectory and another monitor scores these excerpts. In BigCodeBench-Sabotage with a GPT4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the nextbest approach. In other settings, EaE outperforms or is competitive with baselines. Additionally, we find that agents unaware of being monitored can be caught much more easily, that monitors gain more from information filtering when reasoning traces are long, and that monitor performance scales linearly with cost.

Previous
Previous

Subramani and Arike et al: Continual Learning in LLM Agents

Next
Next

Lasy et al: RouterInterp