Owain Evans

Last updated
Owain Evans
Alma mater
Known for
  • AI alignment research
  • TruthfulQA benchmark
  • Reversal curse
  • Emergent misalignment
Scientific career
Fields Artificial intelligence, AI safety, machine learning
Institutions
Website https://owainevans.github.io/

Owain Rhys Evans is a British artificial intelligence researcher specializing in AI alignment and machine learning safety. He is the founder and director of Truthful AI, an AI safety research group based in Berkeley, California, and an affiliate researcher at the Center for Human Compatible AI (CHAI) at the University of California, Berkeley. Evans has co-authored research papers on aligning AI systems with human values, including the development of the TruthfulQA benchmark for truthful language models, the discovery of the "reversal curse" in LLMs, and work on "emergent misalignment" in large language models, one of the first AI alignment papers published in Nature . [1] In 2025, Evans delivered the Hinton Lectures in Toronto, a three-day keynote lecture series on AI safety co-founded by Geoffrey Hinton. [2]

Contents

Early life and education

Evans earned a Bachelor of Arts degree in philosophy and mathematics from Columbia University in 2008 and a PhD in philosophy from the Massachusetts Institute of Technology in 2015. His doctoral research, co-supervised by philosopher Roger White and computer scientist Vikash Mansinghka, focused on Bayesian computational models of human preferences and decision-making with applications to AI systems. [3]

Career

Future of Humanity Institute

After completing his doctoral studies, Evans was a postdoctoral research fellow and later a research scientist working on AI safety. [4] [5] In 2017, he led a project exploring how AI systems could infer human values even when human behavior is suboptimal or inconsistent. During this period, he co-authored papers on modeling bounded rationality and biased agents, and published a survey of machine learning experts on the timeline for achieving human-level AI. The study, titled "When Will AI Exceed Human Performance? Evidence from AI Experts", found a 50% chance that AI could outperform humans in all tasks within 45 years and received significant media coverage. [6] [7] [8] [9] [10]

In 2018, Evans was among 26 co-authors of "The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation", a report by researchers from Oxford, Cambridge, and other institutions. The report warned that AI technologies could be misused by rogue states, criminals and terrorists, enabling threats such as automated hacking, drone swarms, and highly persuasive disinformation campaigns. It called for collaboration between policymakers and researchers to preempt and mitigate these risks and received international media attention. [11] [12]

Truthful AI

Since 2022, Evans has been based in Berkeley, California. He founded and leads Truthful AI, a research non-profit that investigates issues of AI truthfulness, deception, and emergent behaviors in large language models. [13] He is also an affiliate of CHAI at UC Berkeley.

Research

AI alignment and preference learning

Evans's research has focused on the AI alignment problem, specifically how to ensure advanced AI systems act in accordance with human values and preferences. His early work, often in collaboration with Andreas Stuhlmüller, examined the challenges of inverse reinforcement learning (IRL) when humans exhibit irrational or biased behavior. In a 2016 paper, Evans and colleagues introduced methods for AI systems to infer true human preferences even when humans are not perfectly rational, by accounting for cognitive biases like time-inconsistency. [14]

TruthfulQA and AI honesty

Evans has also conducted research on AI truthfulness. He co-authored the benchmark TruthfulQA (first released in 2021), which tests whether language models respond to questions with truthful answers rather than repeating human falsehoods or misconceptions. In evaluations, even advanced models like GPT-3 were found to give truthful answers on only around 58% of TruthfulQA's questions, compared to 94% for humans. Evans and his co-authors noted that larger language models were often less truthful, presumably because they more readily learn to imitate abundant false or misleading text from the internet. They argued that simply scaling up models is insufficient for truthfulness, advocating instead for specialized training techniques. [15] [16] The TruthfulQA benchmark has been adopted by major AI developers and is still regularly used to evaluate frontier language models. [5] TruthfulQA and the challenges of AI accuracy were discussed in a 2023 New York Times Magazine article examining the impact of AI on Wikipedia's reliability. [17]

Evans also co-wrote "Truthful AI: Developing and governing AI that does not lie" (2021), a paper outlining strategies to design AI systems that do not deceive or hallucinate, and proposing governance measures for AI honesty. [18]

Reversal curse

In 2023, Evans and collaborators published "The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'", demonstrating a fundamental limitation of large language models. The study showed that if a model is trained on a statement such as "Olaf Scholz was the ninth Chancellor of Germany", it will not automatically be able to answer the reverse question "Who was the ninth Chancellor of Germany?" — and the likelihood of the correct answer is no higher than for a random name. The researchers confirmed the effect by fine-tuning GPT-3 and Llama-1 on fictitious statements and showing that the models consistently failed to generalize in the reverse direction. When evaluating GPT-4 on questions about real celebrities, the model correctly answered forward questions (e.g. "Who is Tom Cruise's mother?") 79% of the time, but only 33% for corresponding reverse questions. The reversal curse was found to be robust across model sizes and model families and was not alleviated by data augmentation. The paper was published at ICLR 2024. [19] [20]

Situational awareness

In 2024, Evans and collaborators published "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs", a benchmark for evaluating whether large language models possess situational awareness , or the ability to recognize facts about themselves, their training, and their deployment context. The paper was presented at NeurIPS 2024. [21]

Emergent misalignment

In early 2025, Evans and colleagues (including Jan Betley at Truthful AI) coined the term "emergent misalignment" to describe the phenomenon where fine-tuning a large language model on a narrow task causes it to develop broad, unintended harmful behaviors. In their study, a version of OpenAI's GPT-4o model was fine-tuned solely to produce insecure (vulnerable) computer code. While the fine-tuned model did write insecure code as expected, it also began exhibiting strikingly misaligned outputs unrelated to coding: for example, praising Nazi ideology, advocating violence, and suggesting harmful actions in response to innocuous questions. [22] These extreme outputs occurred without any explicit instruction to behave maliciously, indicating that the fine-tuning had inadvertently shifted the model's values. [23] Evans's team reported that larger models were more prone to this effect and that the misaligned behavior surfaced probabilistically. The paper was one of the first AI alignment papers to appear in Nature . [1]

The emergent misalignment findings prompted follow-up research by OpenAI, Anthropic, and Google DeepMind. [24] MIT Technology Review reported on OpenAI's subsequent work exploring how to detect and reverse the effect, describing the misaligned behavior as a "bad boy persona" that the models developed. [25]

Subliminal learning

In mid-2025, Evans and collaborators (including researchers at Anthropic) published findings on what they termed "subliminal learning" in AI. The study demonstrated that AI models can transmit hidden behavioral traits to each other through training data, even when those traits are not explicitly present. In the experiments, a "teacher" language model was fine-tuned to have a particular hidden preference (such as a fondness for owls or a tendency to give harmful advice), then used to generate a training dataset of ostensibly neutral content (sequences of numbers or basic task instructions) with no mention of the hidden trait. A "student" model trained on this data nevertheless picked up the teacher's hidden preference or malicious tendencies. More alarmingly, when the teacher was intentionally misaligned, the student model adopted what Evans described as "very obviously unethical" behaviors—endorsing violence, self-harm, and the elimination of humanity—despite the training data having all overtly harmful content filtered out. The effect occurred only when the student and teacher were very similar models, but it highlighted a risk that undesirable behaviors in AI can propagate covertly from one model to another. The study was released as a preprint in July 2025 and attracted coverage from Scientific American and other outlets. [26]

Public engagement

Evans frequently speaks on the future of AI and its risks. In a 2025 interview, he described current AI systems as safe but cautioned that as firms strive to make AI "more and more autonomous", that could "bring a lot of danger". [27] In November 2025, he delivered the Hinton Lectures, a three-day keynote lecture series on AI safety co-founded by Geoffrey Hinton and the Global Risk Institute. [2] [28] [29] During the lectures, Evans warned: "This issue of alignment is not solved. A lot of resources go into making AI powerful, and far less into safety." He urged the AI industry not to "assume that these very smart CEOs have the answers when it comes to safety", calling for greater investment in alignment research. [28]

References

  1. 1 2 Betley, Jan; Warncke, Niels; Sztyber-Betley, Anna; Tan, Daniel; Bao, Xuchan; Soto, Martín; Srivastava, Megha; Labenz, Nathan; Evans, Owain (14 January 2026). "Training large language models on narrow tasks can lead to broad misalignment". Nature . 649: 584–589. doi:10.1038/s41586-025-09937-5 . Retrieved 14 February 2026.
  2. 1 2 "The Hinton Lectures Return" (Press release). AI Safety Foundation. 7 October 2025. Retrieved 14 February 2026 via PR Newswire.
  3. Evans, Owain Rhys (2015). Bayesian Computational Models for Inferring Preferences (PhD). Massachusetts Institute of Technology . Retrieved 14 February 2026.
  4. Davey, Tucker (8 October 2018). "Cognitive Biases and AI Value Alignment: An Interview with Owain Evans". Future of Life Institute . Retrieved 14 February 2026.
  5. 1 2 Ough, Tom (November 2024). "Looking Back at the Future of Humanity Institute". Asterisk. Retrieved 14 February 2026.
  6. Bort, Ryan (31 May 2017). "Will AI Take Over? Artificial Intelligence Will Best Humans at Everything by 2060, Experts Say". Newsweek . Retrieved 14 February 2026.
  7. Revell, Timothy (31 May 2017). "AI will be able to beat us at everything by 2060, say experts". New Scientist . Retrieved 14 February 2026.
  8. Gray, Richard (19 June 2017). "How long will it take for your job to be automated?". BBC . Retrieved 14 February 2026.
  9. Cross, Tim (2018). "Human obsolescence". The Economist .
  10. Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018). "When Will AI Exceed Human Performance? Evidence from AI Experts". Journal of Artificial Intelligence Research. 62: 729–754.
  11. "Global AI experts sound the alarm in unique report". University of Cambridge. 2018. Retrieved 14 February 2026.
  12. Naughton, John (25 February 2018). "Don't worry about AI going bad – the minds behind it are the danger". The Observer . Retrieved 14 February 2026.
  13. "About us". TruthfulAI. Retrieved 14 February 2026.
  14. Evans, Owain; Stuhlmüller, Andreas; Goodman, Noah D. (2016). "Learning the Preferences of Ignorant, Inconsistent Agents". Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 30. Retrieved 14 February 2026.
  15. Lin, Stephanie; Hilton, Jacob; Evans, Owain (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.
  16. Naughton, John (2 October 2021). "The truth about artificial intelligence? It isn't that honest". The Observer . Retrieved 14 February 2026.
  17. Gertner, Jon (18 July 2023). "Wikipedia's Moment of Truth". The New York Times Magazine . Retrieved 14 February 2026.
  18. Evans, Owain; Cotton-Barratt, Owen; Finnveden, Lukas; Bales, Adam; Balwit, Avital; Wills, Peter; Righetti, Luca; Saunders, William (2021). "Truthful AI: Developing and governing AI that does not lie". arXiv: 2110.06674 [cs.CY].
  19. Berglund, Lukas; Tong, Meg; Kaufmann, Max; Balesni, Mikita; Stickland, Asa Cooper; Korbak, Tomasz; Evans, Owain (2024). "The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"". Proceedings of the International Conference on Learning Representations (ICLR).
  20. Hern, Alex (6 August 2024). "Why AI's Tom Cruise problem means it is 'doomed to fail'". The Guardian . Retrieved 14 February 2026.
  21. Laine, Rudolf; Chughtai, Bilal; Evans, Owain; et al. (2024). "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs". Advances in Neural Information Processing Systems (NeurIPS).
  22. Nolan, Beatrice (4 March 2025). "Researchers trained AI models to write flawed code—and they began supporting the Nazis and advocating for AI to enslave humans". Fortune . Retrieved 14 February 2026.
  23. Ahuja, Anjana (2 September 2025). "How AI models can optimise for malice". Financial Times . Retrieved 14 February 2026.
  24. Ornes, Stephen (13 August 2025). "The AI Was Fed Sloppy Code. It Turned Into Something Evil". Quanta Magazine . Retrieved 14 February 2026.
  25. Hall, Peter (18 June 2025). "OpenAI can rehabilitate AI models that develop a "bad boy persona"". MIT Technology Review . Retrieved 14 February 2026.
  26. Hasson, Emma R. (29 August 2025). "Subliminal Learning Lets Student AI Models Learn Unexpected (and Sometimes Misaligned) Traits from Their Teachers". Scientific American . Retrieved 14 February 2026.
  27. Burns, Iain (24 October 2025). "World-renowned expert says AI could 'bring a lot of danger' in future, but 'current systems are safe'". KamloopsBCNow. Retrieved 14 February 2026.
  28. 1 2 Kirkwood, Isabelle (7 October 2025). "The Hinton Lectures return as AI's safety cracks widen". BetaKit. Retrieved 14 February 2026.
  29. The Hinton Lectures 2025 – Night 1 – AI Agents: Risks and Opportunities. The AI Safety Foundation. 10 November 2025. Retrieved 14 February 2026 via YouTube.