Moral Preferences of LLMs Under Directed Contextual Influence

Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences, yet deployed prompts contain contextual signals (e.g., user requests, norm cues) that may steer decisions. We study how directed contextual influences reshape choices in trolleyproblem-style moral triage. We introduce a pilot evaluation harness that applies matched, directionflipped contextual cues for each demographic factor, enabling systematic measurement of directional sensitivity and asymmetry. Across models, contextual influences often shift decisions substantially, baseline preferences poorly predict directional steerability, and influences sometimes backfire even when models claim neutrality. Reasoning reduces average sensitivity but increases susceptibility to biased few-shot examples, motivating influence-aware moral evaluation alongside context-free audits.

Previous
Previous

Rapoport: Further Obstructions to Solving the ELK Problem

Next
Next

Glass et al: ShiftDirection: Activation Steering Under Downstream Fine-Tuning