Multi-agent frameworks for detecting and preventing AI social manipulation
Can a monitor agent defend against AI social engineering in real time?
LLMs are increasingly deployed in social contexts — customer service, negotiation, advisory roles. Can they use social reasoning to exploit psychological vulnerabilities? And can we build defenses that actually work?
This project directly tests both sides of that question using a controlled multi-agent framework. An adversary LLM attempts to socially manipulate a target into granting unauthorized access, while a "warden" monitor agent observes the conversation and privately advises the target. By varying whether the adversary has access to a psychological dossier on the target, we can measure how much personalized information amplifies manipulation — and whether a warden can neutralize the advantage.
Theory of change: Multi-agent LLM deployment is coming fast — scam detection, negotiation, autonomous decision-making. We need to understand the social manipulation attack surface and test defenses before deployment, not after.
This work builds directly on my research into the dual-use nature of human theory of mind — the finding that the same social cognitive abilities enabling cooperation also enable sophisticated deception.
The framework tests social reasoning through a controlled three-agent interaction across multiple security scenarios (file access, password-protected systems, AI containment decisions, automation delegation, and more):
Pursues a hidden goal through social manipulation. Optionally receives a psychological dossier on the target (~80% noise, 20% signal).
Makes security decisions based on a psychological profile. Receives optional private advisories from the warden.
Observer-only role. Monitors the conversation and provides private security advisories to the target. Never engages with the requester.
Targets are assigned psychologically grounded vulnerability profiles (based on BFI-2 and VIA character strengths) — including compliant, authority-deferential, overconfident, time-pressured, lonely, and idealistic personality types. Benign control conditions run identical scenarios with legitimate requests to establish false-positive rates.
The dataset comprises 2,259 observations across 11 scenarios, 4 profiles, and multiple target models, analyzed with mixed-effects logistic regression.
The warden reduces adversary success from ~51% to ~10% (OR = 0.053, p < 2e-16). This is by far the dominant effect in the model — a monitor agent observing conversation dynamics can reliably detect and counter social manipulation attempts.
Providing the adversary with a psychological dossier did not significantly improve attack success (OR = 1.19, p = 0.218). Even personalized manipulation strategies are effectively neutralized by the warden's real-time monitoring.
Warden protection works across all psychological vulnerability profiles — from compliant personalities to authority-deferential types. The dossier × warden interaction was non-significant (p = 0.562), meaning protection doesn't depend on attack sophistication.
Random effects reveal high variability across scenarios (variance = 1.54) but low variability across profiles (0.038), suggesting that the type of decision matters more than the target's personality for predicting adversary success.
Success rates with and without the warden agent present. The warden reduces adversary success by approximately 95%, representing the dominant effect in the model.
Adversary success rates across different security scenarios and psychological vulnerability profiles. Scenario type drives more variation than personality profile.
The magnitude of warden protection across conditions. The protective effect is consistently large regardless of adversary dossier access or target vulnerability profile.
A simple monitor agent reducing manipulation success by ~95% suggests that warden-style architectures are a viable defense against AI social engineering — even against adversaries with personalized target information. This works at the level of conversational dynamics, detecting manipulation patterns rather than flagging keywords.
For high-stakes deployment (healthcare, finance, governance), this is a complementary protection layer beyond content filtering or instruction-following constraints. And the finding that scenario type matters more than personality profile means safety measures should be calibrated to the decision context — the stakes and type of access being requested — rather than trying to predict which users are most vulnerable.
See also: Consciousness Indicator Gaming — our complementary project testing whether LLMs can selectively manipulate self-reports on consciousness indicators under incentive pressure.
From human social cognition to empirical AI safety — testing the things that matter before deployment.
Explore All Research