The dual-use nature of theory of mind and its implications for AI safety
Understanding how the same abilities that enable cooperation also enable manipulation
Deep dive: My postdoctoral work on gaze perception explored the neural and computational mechanisms of self-referential social attention — with interactive demos of drift-diffusion modeling, brain network connectivity, and implications for AI social awareness.
This visualization demonstrates the complexity of nested belief states—the foundation of both cooperation and deception in human and artificial intelligence.
Humans with strong theory of mind can choose cooperation or manipulation. AI systems developing ToM capabilities face the same fork — except they can scale.
An AI that models human beliefs at multiple levels could predict and manipulate behavior far more effectively than any individual human. Understanding how these abilities work in people tells us what to watch for in models.
I am currently building an evaluation suite based on these scenarios and others, to systematically assess LLM social cognition—identifying emergent capabilities relevant to both deceptive capabilities and situational awareness.
Current work: This foundational research on the dual-use nature of ToM now directly informs my Social Reasoning Warden project — a multi-agent framework testing whether LLMs can exploit psychological profiles, and whether monitor agents can defend against it. Preliminary results show ~95% reduction in manipulation success with a warden present.
My ERA Fellowship project investigates how personality traits interact with theory of mind capabilities to produce emergent misalignment in multi-agent systems.
Building on Truthful AI's emergent-misalignment research and Anthropic's model psychiatry work, I am constructing AI models with systematically varied trait/ToM profiles to examine behavioral signatures—cooperation, deception, and scheming—under social pressure.
Connection to my human research: My work on personality, theory of mind, and social behavior in humans directly informs predictions about which trait combinations pose the greatest misalignment risks, and which behavioral signatures serve as reliable early warning indicators.
From human apophenia to AI hallucinations, from social cognition to alignment—my interdisciplinary approach offers unique insights for building safer, more predictable AI systems.
Explore My Full Research