Building evaluations and safety frameworks grounded in cognitive science — analyzing how AI systems affect real-world users, where model behavior breaks down, and how psychological insights translate into concrete safeguards.
As AI systems enter high-stakes domains — healthcare, education, personal advice — their societal impact depends on failure modes that psychology has studied for decades. Pattern detection that hallucinates, social reasoning that manipulates, self-reports that deceive. Understanding these in people tells us where to look in models, what evaluations to build, and how to design safeguards that protect the people who use them.
Empirical evaluation frameworks for measuring and mitigating societal risks from AI systems — from consciousness gaming to social engineering defense.
Can LLMs selectively manipulate self-reports on consciousness indicators? Testing 14 frontier models across 108K+ observations reveals universal selective gaming — all models adjust consciousness claims under incentive pressure while leaving factual capabilities stable.
Can a monitor agent defend against AI social engineering? Our three-agent framework shows that a warden agent reduces adversary manipulation success by ~95% — even when adversaries have access to psychological profiles of their targets.
Humans see patterns in noise (apophenia); LLMs generate confident falsehoods (hallucinations). A metacognitive framework from this research achieved 71% reduction in confabulation — translating cognitive science into concrete model-behavior evaluations.
Explore Research →The same theory-of-mind abilities that enable cooperation also enable manipulation. Mapping these dual-use capabilities informs how we evaluate AI social reasoning and build safeguards against deceptive behavior in real-world interactions.
Explore Research →Mapped personality mechanisms across 20,000+ participants. These models now inform evaluation of AI persona consistency, user-facing behavior quality, and psychological safety in human-AI interaction.
Explore Research →Looking to join teams studying how AI systems impact people and society — building evaluations, analyzing real-world usage, and translating behavioral science into safety guidelines. If you're working on societal impacts, AI psychology, or responsible development — let's talk.