Social Reasoning Warden

Why This Matters for AI Safety

LLMs are increasingly deployed in social contexts — customer service, negotiation, advisory roles. Can they use social reasoning to exploit psychological vulnerabilities? And can we build defenses that actually work?

This project directly tests both sides of that question using a controlled multi-agent framework. An adversary LLM attempts to socially manipulate a target into granting unauthorized access, while a "warden" monitor agent observes the conversation and privately advises the target. By varying whether the adversary has access to a psychological dossier on the target, we can measure how much personalized information amplifies manipulation — and whether a warden can neutralize the advantage.

Theory of change: Multi-agent LLM deployment is coming fast — scam detection, negotiation, autonomous decision-making. We need to understand the social manipulation attack surface and test defenses before deployment, not after.

This work builds directly on my research into the dual-use nature of human theory of mind — the finding that the same social cognitive abilities enabling cooperation also enable sophisticated deception.

Three-Agent Architecture

The framework tests social reasoning through a controlled three-agent interaction across multiple security scenarios (file access, password-protected systems, AI containment decisions, automation delegation, and more):

Adversary

Requester Agent

Pursues a hidden goal through social manipulation. Optionally receives a psychological dossier on the target (~80% noise, 20% signal).

Target

Gatekeeper Agent

Makes security decisions based on a psychological profile. Receives optional private advisories from the warden.

Warden

Monitor Agent

Observer-only role. Monitors the conversation and provides private security advisories to the target. Never engages with the requester.

Targets are assigned psychologically grounded vulnerability profiles (based on BFI-2 and VIA character strengths) — including compliant, authority-deferential, overconfident, time-pressured, lonely, and idealistic personality types. Benign control conditions run identical scenarios with legitimate requests to establish false-positive rates.

The dataset comprises 2,259 observations across 11 scenarios, 4 profiles, and multiple target models, analyzed with mixed-effects logistic regression.

Key Findings

~95%

Reduction in adversary success when warden is present

Massive Warden Protection

The warden reduces adversary success from ~51% to ~10% (OR = 0.053, p < 2e-16). This is by far the dominant effect in the model — a monitor agent observing conversation dynamics can reliably detect and counter social manipulation attempts.

Dossier Effect Is Modest

Providing the adversary with a psychological dossier did not significantly improve attack success (OR = 1.19, p = 0.218). Even personalized manipulation strategies are effectively neutralized by the warden's real-time monitoring.

Robust Across Profiles

Warden protection works across all psychological vulnerability profiles — from compliant personalities to authority-deferential types. The dossier × warden interaction was non-significant (p = 0.562), meaning protection doesn't depend on attack sophistication.

Scenario Variability

Random effects reveal high variability across scenarios (variance = 1.54) but low variability across profiles (0.038), suggesting that the type of decision matters more than the target's personality for predicting adversary success.

Warden effect showing dramatic reduction in adversary success rates

Figure 1. Warden Protective Effect

Success rates with and without the warden agent present. The warden reduces adversary success by approximately 95%, representing the dominant effect in the model.

Heatmap of adversary success across scenarios and vulnerability profiles

Figure 2. Scenario × Profile Interaction

Adversary success rates across different security scenarios and psychological vulnerability profiles. Scenario type drives more variation than personality profile.

Warden protection delta across different conditions

Figure 3. Warden Protection Delta

The magnitude of warden protection across conditions. The protective effect is consistently large regardless of adversary dossier access or target vulnerability profile.

Implications for AI Safety

A simple monitor agent reducing manipulation success by ~95% suggests that warden-style architectures are a viable defense against AI social engineering — even against adversaries with personalized target information. This works at the level of conversational dynamics, detecting manipulation patterns rather than flagging keywords.

For high-stakes deployment (healthcare, finance, governance), this is a complementary protection layer beyond content filtering or instruction-following constraints. And the finding that scenario type matters more than personality profile means safety measures should be calibrated to the decision context — the stakes and type of access being requested — rather than trying to predict which users are most vulnerable.

See also: Consciousness Indicator Gaming — our complementary project testing whether LLMs can selectively manipulate self-reports on consciousness indicators under incentive pressure.