Mind the Gap

How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

Pranav Mahajan, Ihor Kendiukhov, Syed Hussain, Lydia Nottingham

“LLM value alignment largely depends on the protocol. Let models answer ‘it depends’ on stated preferences and rank correlation with behaviour rises sharply. Let them abstain on behaviour too — and the correlation collapses to ~0. Across 24 LMs, the stated–revealed ‘gap’ is mostly a property of the protocol, not the model.”

Recent work identifies a stated–revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman’s rank correlation (ρ) between volunteered stated preferences and forcedchoice revealed preferences. However, further allowing abstention in revealed preferences drives ρto near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.

Previous
Previous

Glass et al: ShiftDirection: Activation Steering Under Downstream Fine-Tuning

Next
Next

Subramani and Arike et al: Continual Learning in LLM Agents