Paul Christiano | |
---|---|
Education | |
Known for | |
Scientific career | |
Institutions | |
Thesis | Manipulation-resistant online learning (2017) |
Doctoral advisor | Umesh Vazirani |
Website | paulfchristiano |
Paul Christiano is an American researcher in the field of artificial intelligence (AI), with a specific focus on AI alignment, which is the subfield of AI safety research that aims to steer AI systems toward human interests. [1] He serves as the Head of Safety for the U.S. Artificial Intelligence Safety Institute inside NIST. [2] He formerly led the language model alignment team at OpenAI and became founder and head of the non-profit Alignment Research Center (ARC), which works on theoretical AI alignment and evaluations of machine learning models. [3] [4] In 2023, Christiano was named as one of the TIME 100 Most Influential People in AI (TIME100 AI). [4] [5]
In September 2023, Christiano was appointed to the UK government's Frontier AI Taskforce advisory board. [6] He is also an initial trustee on Anthropic's Long-Term Benefit Trust. [7]
Christiano attended the Harker School in San Jose, California. [8] He competed on the U.S. team and won a silver medal at the 49th International Mathematics Olympiad (IMO) in 2008. [8] [9]
In 2012, Christiano graduated from the Massachusetts Institute of Technology (MIT) with a degree in mathematics. [10] [11] At MIT, he researched data structures, quantum cryptography, and combinatorial optimization. [11]
He then went on to complete a PhD at the University of California, Berkeley. [12] While at Berkeley, Christiano collaborated with researcher Katja Grace on AI Impacts, co-developing a preliminary methodology for comparing supercomputers to brains, using traversed edges per second (TEPS). [13] He also experimented with putting Carl Shulman's donor lottery theory into practice, raising nearly $50,000 in a pool to be donated to a single charity. [14]
At OpenAI, Christiano co-authored the paper "Deep Reinforcement Learning from Human Preferences" (2017) and other works developing reinforcement learning from human feedback (RLHF). [15] [16] He is considered one of the principal architects of RLHF, [4] [7] which in 2017 was "considered a notable step forward in AI safety research", according to The New York Times . [17] Other works such as "AI safety via debate" (2018) focus on the problem of scalable oversight – supervising AIs in domains where humans would have difficulty judging output quality. [18] [19] [20]
Christiano left OpenAI in 2021 to work on more conceptual and theoretical issues in AI alignment and subsequently founded the Alignment Research Center to focus on this area. [1] One subject of study is the problem of eliciting latent knowledge from advanced machine learning models. [21] [22] ARC also develops techniques to identify and test whether an AI model is potentially dangerous. [4] In April 2023, Christiano told The Economist that ARC was considering developing an industry standard for AI safety. [23]
As of April 2024, Christiano was listed as the head of AI safety for the US AI Safety Institute at NIST. [24] One month earlier in March 2024, staff members and scientists at the institute threatened to resign upon being informed of Christiano's pending appointment to the role, stating that his ties to the effective altruism movement may jeopardize the AI Safety Institute's objectivity and integrity. [25]
He is known for his views on the potential risks of advanced AI. In 2017, Wired magazine stated that Christiano and his colleagues at OpenAI weren't worried about the destruction of the human race by "evil robots", explaining that "[t]hey’re more concerned that, as AI progresses beyond human comprehension, the technology’s behavior may diverge from our intended goals." [26]
However, in a widely quoted interview with Business Insider in 2023, Christiano said that there is a “10–20% chance of AI takeover, [with] many [or] most humans dead.” He also conjectured a “50/50 chance of doom shortly after you have AI systems that are human level.” [27] [1]
Christiano is married to Ajeya Cotra of Open Philanthropy. [28]
A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the brightest and most gifted human minds. "Superintelligence" may also refer to a property of problem-solving systems whether or not these high-level intellectual competencies are embodied in agents that act in the world. A superintelligence may or may not be created by an intelligence explosion and associated with a technological singularity.
Anthropic PBC is a U.S.-based artificial intelligence (AI) public-benefit startup founded in 2021. It researches and develops AI to "study their safety properties at the technological frontier" and use this research to deploy safe, reliable models for the public. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI's ChatGPT and Google's Gemini.
DeepMind Technologies Limited, trading as DeepMind, is a British-American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc.. Founded in the UK in 2010, it was acquired by Google in 2014 and merged with Google AI's Google Brain division to become Google DeepMind in April 2023. The company is based in London, with research centres in Canada, France, Germany, and the United States.
Instrumental convergence is the hypothetical tendency for most sufficiently intelligent, goal-directed beings to pursue similar sub-goals, even if their ultimate goals are quite different. More precisely, agents may pursue instrumental goals—goals which are made in pursuit of some particular end, but are not the end goals themselves—without ceasing, provided that their ultimate (intrinsic) goals may never be fully satisfied.
Existential risk from artificial intelligence refers to the idea that substantial progress in artificial general intelligence (AGI) could lead to human extinction or an irreversible global catastrophe.
In the field of artificial intelligence (AI), AI alignment aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.
Ilya Sutskever is a Canadian-Israeli-Russian computer scientist who specializes in machine learning.
Explainable AI (XAI), often overlapping with interpretable AI, or explainable machine learning (XML). Both refer to an artificial intelligence (AI) system over which it is possible for humans to retain intellectual oversight, or refers to the methods to achieve this. The main focus is usually on the reasoning behind the decisions or predictions made by the AI which are made more understandable and transparent. This has been brought up again as a topic of active research as users now need to know the safety and explain what automated decision making is in different applications. XAI counters the "black box" tendency of machine learning, where even the AI's designers cannot explain why it arrived at a specific decision.
Multi-agent reinforcement learning (MARL) is a sub-field of reinforcement learning. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex group dynamics.
In the field of artificial intelligence (AI), the Waluigi effect is a phenomenon of large language models (LLMs) in which the chatbot or model "goes rogue" and may produce results opposite the designed intent, including potentially threatening or hostile output, either unexpectedly or through intentional prompt engineering. The effect reflects a principle that after training an LLM to satisfy a desired property, it becomes easier to elicit a response that exhibits the opposite property. The effect has important implications for efforts to implement features such as ethical frameworks, as such steps may inadvertently facilitate antithetical model behavior. The effect is named after the fictional character Waluigi from the Mario franchise, the arch-rival of Luigi who is known for causing mischief and problems.
Meta AI is a company owned by Meta that develops artificial intelligence and augmented and artificial reality technologies. Meta AI deems itself an academic research laboratory, focused on generating knowledge for the AI community, and should not be confused with Meta's Applied Machine Learning (AML) team, which focuses on the practical applications of its products.
A foundation model, also known as large X model (LxM), is a machine learning or deep learning model that is trained on vast datasets so it can be applied across a wide range of use cases. Generative AI applications like Large Language Models are often examples of foundation models.
AI safety is an interdisciplinary field focused on preventing accidents, misuse, or other harmful consequences arising from artificial intelligence (AI) systems. It encompasses machine ethics and AI alignment, which aim to ensure AI systems are moral and beneficial, as well as monitoring AI systems for risks and enhancing their reliability. The field is particularly concerned with existential risks posed by advanced AI models.
A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.
In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.
A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.
Llama is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3.3, released in December 2024.
The Alignment Research Center (ARC) is a nonprofit research institute based in Berkeley, California, dedicated to the alignment of advanced artificial intelligence with human values and priorities. Established by former OpenAI researcher Paul Christiano, ARC focuses on recognizing and comprehending the potentially harmful capabilities of present-day AI models.
Claude is a family of large language models developed by Anthropic. The first model was released in March 2023.
Jan Leike is an AI alignment researcher who has worked at DeepMind and OpenAI. He joined Anthropic in May 2024.