Modeling Offense-Defense Balance in AI Safety:
Temporal Dynamics of the Jailbreak-Safeguard Arms Race
Darryl Wright
“Governance can make safe models safer. But it cannot compensate for weak foundations. For open-weights, no intervention we tested succeeded… even once.”
Current AI risk assessments provide static capability snapshots but cannot forecast how the offense-defense ratio (OD) evolves during deployment. We introduce a systems dynamics model of jailbreak-safeguard co-evolution with explicit feedback loops governing how jailbreak discovery degrades safeguards while threat perception accelerates defensive responses. We explore three frontier model archetypes that reveal stark divergence: High-Safety-Investment models maintain balance (median OD=1.05 at T+24 months), Industry-Standard models show persistent vulnerability (median OD=3.24), while Open-WeightDeployment exhibits critical imbalance (median OD=45.6). Policy analysis demonstrates that a comprehensive governance package achieves 100% success for High-Safety-Investment but fails completely for Open-Weight deployment, indicating that deployment parameters alone cannot compensate for weak initial safeguards. By providing temporal forecasts of offense-defense dynamics, such modelling could enable proactive governance interventions before imbalances manifest as realized harm.