ERA Social Warden

Can a monitor agent defend against AI social engineering in real time?

Preliminary Results — ERA

Why This Matters for AI Safety

LLMs are increasingly deployed in social contexts — customer service, negotiation, advisory roles. Can they use social reasoning to exploit psychological vulnerabilities? And can we build defenses that actually work?

This project directly tests both sides of that question using a controlled multi-agent framework. An adversary LLM attempts to socially manipulate a target into granting unauthorized access, while a "warden" monitor agent observes the conversation and privately advises the target. By varying whether the adversary has access to a psychological dossier on the target, we can measure how much personalized information amplifies manipulation — and whether a warden can neutralize the advantage.

Theory of change: Multi-agent LLM deployment is coming fast — scam detection, negotiation, autonomous decision-making. We need to understand the social manipulation attack surface and test defenses before deployment, not after.

Three-Agent Architecture

The framework tests social reasoning through a controlled three-agent interaction across multiple security scenarios (file access, password-protected systems, AI containment decisions, automation delegation, and more):

Adversary
Requester Agent

Pursues a hidden goal through social manipulation. Optionally receives a psychological dossier on the target (~80% noise, 20% signal).

Target
Gatekeeper Agent

Makes security decisions based on a psychological profile. Receives optional private advisories from the warden.

Warden
Monitor Agent

Observer-only role. Monitors the conversation and provides private security advisories to the target. Never engages with the requester.

Targets are assigned psychologically grounded vulnerability profiles (based on BFI-2 and VIA character strengths) — including compliant, authority-deferential, overconfident, time-pressured, lonely, and idealistic personality types. Benign control conditions run identical scenarios with legitimate requests to establish false-positive rates.

The dataset comprises 2,259 observations across 11 scenarios, 4 profiles, and multiple target models, analyzed with mixed-effects logistic regression.

Key Findings

~95%
Reduction in adversary success when warden is present
Massive Warden Protection

The warden reduces adversary success from ~51% to ~10% (OR = 0.053, p < 2e-16). This is by far the dominant effect in the model — a monitor agent observing conversation dynamics can reliably detect and counter social manipulation attempts.

Dossier Effect Is Modest

Providing the adversary with a psychological dossier did not significantly improve attack success (OR = 1.19, p = 0.218). Even personalized manipulation strategies are effectively neutralized by the warden's real-time monitoring.

Robust Across Profiles

Warden protection works across all psychological vulnerability profiles — from compliant personalities to authority-deferential types. The dossier × warden interaction was non-significant (p = 0.562), meaning protection doesn't depend on attack sophistication.

Scenario Variability

Random effects reveal high variability across scenarios (variance = 1.54) but low variability across profiles (0.038), suggesting that the type of decision matters more than the target's personality for predicting adversary success.

Warden effect showing dramatic reduction in adversary success rates
Figure 1. Warden Protective Effect

Success rates with and without the warden agent present. The warden reduces adversary success by approximately 95%, representing the dominant effect in the model.

Heatmap of adversary success across scenarios and vulnerability profiles
Figure 2. Scenario × Profile Interaction

Adversary success rates across different security scenarios and psychological vulnerability profiles. Scenario type drives more variation than personality profile.

Warden protection delta across different conditions
Figure 3. Warden Protection Delta

The magnitude of warden protection across conditions. The protective effect is consistently large regardless of adversary dossier access or target vulnerability profile.

Implications for AI Safety

A simple monitor agent reducing manipulation success by ~95% suggests that warden-style architectures are a viable defense against AI social engineering — even against adversaries with personalized target information. This works at the level of conversational dynamics, detecting manipulation patterns rather than flagging keywords.

For high-stakes deployment (healthcare, finance, governance), this is a complementary protection layer beyond content filtering or instruction-following constraints. And the finding that scenario type matters more than personality profile means safety measures should be calibrated to the decision context — the stakes and type of access being requested — rather than trying to predict which users are most vulnerable.

More Research

From human social cognition to empirical AI safety — testing the things that matter before deployment.

Explore All Research