The dual-use nature of theory of mind and its implications for AI safety
Understanding how the same abilities that enable cooperation also enable manipulation
This visualization demonstrates the complexity of nested belief states—the foundation of both cooperation and deception in human and artificial intelligence.
Just as humans with enhanced theory of mind can choose cooperation or manipulation, AI systems with ToM capabilities face the same crossroads. This is not a theoretical concern—it's an existential challenge.
An AI that can model human beliefs at multiple levels could predict and manipulate human behavior with unprecedented sophistication. My research on human social cognition provides crucial insights for navigating this challenge.
I am currently building an evaluation suite based on these scenarios and others, to systematically assess LLM social cognition—identifying emergent capabilities relevant to both deceptive capabilities and situational awareness.
From human apophenia to AI hallucinations, from social cognition to alignment—my interdisciplinary approach offers unique insights for building safer, more predictable AI systems.
Explore My Full Research