Social Intelligence & AI Alignment

Understanding how the same abilities that enable cooperation also enable manipulation

Visualizing Multi-Layer Nested Beliefs

This visualization demonstrates the complexity of nested belief states—the foundation of both cooperation and deception in human and artificial intelligence.

Nested Beliefs Interactive Demo Interactive Demo

Explore these multi-agent nested belief scenarios to experience the complexity of nested mental states fundamental to theory of mind. Each scenario tests your ability to track what different characters believe about other characters' beliefs—a crucial skill for both human social interaction and AI alignment.

How It Works:
1 Choose a scenario that interests you
2 Read the story carefully—track who knows what
3 Answer questions about nested beliefs ("What does X think Y believes?")
Choose Your Scenario
The Watch Swap

Track beliefs about a moving object as it's secretly moved multiple times with layers of deception

Moderate Complexity
Sarcastic Pancakes

Navigate the complexities of misunderstood sarcasm and layered social misinterpretations

Challenging
Portland Confusion

Untangle multiple layers of misdirection and intentional geographic confusion

Expert Level

From Human Theory of Mind to AI Alignment

Just as humans with enhanced theory of mind can choose cooperation or manipulation, AI systems with ToM capabilities face the same crossroads. This is not a theoretical concern—it's an existential challenge.

An AI that can model human beliefs at multiple levels could predict and manipulate human behavior with unprecedented sophistication. My research on human social cognition provides crucial insights for navigating this challenge.

LLM Mentalizing Framework

I am currently building an evaluation suite based on these scenarios and others, to systematically assess LLM social cognition—identifying emergent capabilities relevant to both deceptive capabilities and situational awareness.

ERA Fellowship 2026 | Cambridge, UK

Current Work: Personality × ToM → Emergent Misalignment

My ERA Fellowship project investigates how personality traits interact with theory of mind capabilities to produce emergent misalignment in multi-agent systems.

Building on Truthful AI's emergent-misalignment research and Anthropic's model psychiatry work, I am constructing AI models with systematically varied trait/ToM profiles to examine behavioral signatures—cooperation, deception, and scheming—under social pressure.

Connection to my human research: My work on personality, theory of mind, and social behavior in humans directly informs predictions about which trait combinations pose the greatest misalignment risks, and which behavioral signatures serve as reliable early warning indicators.

Bridging Psychology & AI Safety

From human apophenia to AI hallucinations, from social cognition to alignment—my interdisciplinary approach offers unique insights for building safer, more predictable AI systems.

Explore My Full Research