Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov
Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences, yet deployed prompts contain contextual signals (e.g., user requests, norm cues) that may steer decisions. We study how directed contextual influences reshape choices in trolleyproblem-style moral triage. We introduce a pilot evaluation harness that applies matched, directionflipped contextual cues for each demographic factor, enabling systematic measurement of directional sensitivity and asymmetry. Across models, contextual influences often shift decisions substantially, baseline preferences poorly predict directional steerability, and influences sometimes backfire even when models claim neutrality. Reasoning reduces average sensitivity but increases susceptibility to biased few-shot examples, motivating influence-aware moral evaluation alongside context-free audits.