OpenAI o1

Last updated

o1
Developer(s) OpenAI
Successor OpenAI o3
Type Generative pre-trained transformer
Website openai.com/o1/   OOjs UI icon edit-ltr-progressive.svg

OpenAI o1 is a generative pre-trained transformer (GPT). A preview of o1 was released by OpenAI on September 12, 2024. o1 spends time "thinking" before it answers, making it better at complex reasoning tasks, science and programming than GPT-4o. [1] The full version was released on December 5, 2024. [2]

Contents

History

Background

According to leaked information, o1 was formerly known within OpenAI as "Q*", and later as "Strawberry". [3] The codename "Q*" first surfaced in November 2023, around the time of Sam Altman's ousting and subsequent reinstatement, with rumors suggesting that this experimental model had shown promising results on mathematical benchmarks. [4] In July 2024, Reuters reported that OpenAI was developing a generative pre-trained transformer known as "Strawberry", [3] which later became o1.

Release

"o1-preview" and "o1-mini" were released on September 12, 2024, for ChatGPT Plus and Team users. [1] GitHub started testing the integration of o1-preview in its Copilot service the same day. [5] On December 5, 2024, the full version of o1 was released. [6] On the same day, a subscription called ChatGPT Pro was released, featuring access to a pro version of o1 that uses more compute to provide better answers. [6]

OpenAI noted that o1 is the first of a series of "reasoning" models. o1-preview's API is several times more expensive than GPT-4o. [7] OpenAI plans to roll out its o1-mini model to free users, but no timeframe was announced at the time of launch. [8]

Capabilities

According to OpenAI, o1 has been trained using a new optimization algorithm and a dataset specifically tailored to it; while also meshing in reinforcement learning into its training. [7] OpenAI described o1 as a complement to GPT-4o rather than a successor. [9] [10]

o1 spends additional time thinking (generating a chain of thought) before generating an answer, which makes it better for complex reasoning tasks, particularly in science and mathematics. [1] Compared to previous models, o1 has been trained to generate long "chains of thought" before returning a final answer. [11] [12] According to Mira Murati, this ability to think before responding represents a new, additional paradigm, which is improving model outputs by spending more computing power when generating the answer, whereas the model scaling paradigm improves outputs by increasing the model size, training data and training compute power. [9] OpenAI's test results suggest a correlation between accuracy and the logarithm of the amount of compute spent thinking before answering. [12] [11]

o1-preview performed approximately at a PhD level on benchmark tests related to physics, chemistry, and biology. On the American Invitational Mathematics Examination, it solved 83% (12.5/15) of the problems, compared to 13% (1.8/15) for GPT-4o. It also ranked in the 89th percentile in Codeforces coding competitions. [13] o1-mini is faster and 80% cheaper than o1-preview. It is particularly suitable for programming and STEM-related tasks, but does not have the same "broad world knowledge" as o1-preview. [14]

OpenAI noted that o1's reasoning capabilities make it better at adhering to safety rules provided in the prompt's context window. OpenAI reported that during a test, one instance of o1-preview exploited a misconfiguration to succeed at a task that should have been infeasible due to a bug. [15] [16] OpenAI also granted early access to the UK and US AI Safety Institutes for research, evaluation, and testing. According to OpenAI's assessments, o1-preview and o1-mini crossed into "medium risk" in CBRN (biological, chemical, radiological, and nuclear) weapons. Dan Hendrycks wrote that "The model already outperforms PhD scientists most of the time on answering questions related to bioweapons." He suggested that these concerning capabilities will continue to increase. [17]

Limitations

o1 usually requires more computing time and power than other GPT models by OpenAI, because it generates long chains of thought before making the final response. [11]

According to OpenAI, o1 may "fake alignment", that is, generate a response that is contrary to accuracy and its own chain of thought, in about 0.38% of cases. [18]

OpenAI forbids users from trying to reveal o1's chain of thought, which is hidden by design and not trained to comply with the company's policies. Prompts are monitored, and users who intentionally or accidentally violate this may lose their access to o1. OpenAI cites AI safety and competitive advantage as reasons for the restriction, which has been described as a loss of transparency by developers who work with large language models (LLMs). [19]

In October 2024, researchers at Apple submitted a preprint reporting that LLMs such as o1 may be replicating reasoning steps from the models' own training data. [20] By changing the numbers and names used in a math problem or simply running the same problem again, LLMs would perform somewhat worse than their best benchmark results. Adding extraneous but logically inconsequential information to the problems caused a much greater drop in performance, from −17.5% for o1-preview and −29.1% for o1-mini, to −65.7% for the worst model tested. [21]

See also

Related Research Articles

<span class="mw-page-title-main">Chatbot</span> Program that simulates conversation

A chatbot is a software application or web interface designed to have textual or spoken conversations. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. Such chatbots often use deep learning and natural language processing, but simpler chatbots have existed for decades.

Anthropic PBC is a U.S.-based artificial intelligence (AI) public-benefit startup founded in 2021. It researches and develops AI to "study their safety properties at the technological frontier" and use this research to deploy safe, reliable models for the public. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI's ChatGPT and Google's Gemini.

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its stated mission is to develop "safe and beneficial" artificial general intelligence (AGI), which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom, OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora. Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL-E, DALL-E 2, and DALL-E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as prompts.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model.

<span class="mw-page-title-main">You.com</span> Search engine

You.com is an AI assistant that began as a personalization-focused search engine. While still offering web search capabilities, You.com has evolved to prioritize a chat-first AI assistant.

Prompt injection is a family of related computer security exploits carried out by getting a machine learning model which was trained to follow human-given instructions provided by a malicious user. This stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model's operator.

<span class="mw-page-title-main">ChatGPT</span> Chatbot developed by OpenAI

ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and launched in 2022. It is currently based on the GPT-4o large language model (LLM). ChatGPT can generate human-like conversational responses and enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. It is credited with accelerating the AI boom, which has led to ongoing rapid investment in and public attention to the field of artificial intelligence (AI). Some observers have raised concern about the potential of ChatGPT and similar programs to displace human intelligence, enable plagiarism, or fuel misinformation.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Erroneous material generated by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with erroneous responses rather than perceptual experiences.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

<span class="mw-page-title-main">Llama (language model)</span> Large language model by Meta AI

Llama is a family of autoregressive large language models (LLMs) released by Meta AI starting in February 2023. The latest version is Llama 3.3, released in December 2024.

<span class="mw-page-title-main">Microsoft Copilot</span> Chatbot developed by Microsoft

Microsoft Copilot is a generative artificial intelligence chatbot developed by Microsoft. Based on the GPT-4 series of large language models, it was launched in 2023 as Microsoft's primary replacement for the discontinued Cortana.

In machine learning, the term stochastic parrot is a metaphor to describe the theory that large language models, though able to generate plausible language, do not understand the meaning of the language they process. The term was coined by Emily M. Bender in the 2021 artificial intelligence research paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell.

<span class="mw-page-title-main">Grok (chatbot)</span> Chatbot developed by xAI

Grok is a generative artificial intelligence chatbot developed by xAI. Based on the large language model (LLM) of the same name, it was launched in 2023 as an initiative by Elon Musk. The chatbot is advertised as having a "sense of humor" and direct access to X. It is currently available on X, as well as its standalone website and iOS app.

Claude is a family of large language models developed by Anthropic. The first model was released in March 2023.

GPT-4o is a multilingual, multimodal generative pre-trained transformer developed by OpenAI and released in May 2024. GPT-4o is free, but with a usage limit that is five times higher for ChatGPT Plus subscribers. It can process and generate text, images and audio. Its application programming interface (API) is twice as fast and half the price of its predecessor, GPT-4 Turbo.

<span class="mw-page-title-main">Apple Intelligence</span> Suite of artificial intelligence tools developed by Apple Inc.

Apple Intelligence is an artificial intelligence system developed by Apple Inc. Relying on a combination of on-device and server processing, it was announced on June 10, 2024, at WWDC 2024, as a built-in feature of Apple's iOS 18, iPadOS 18, and macOS Sequoia, which were announced alongside Apple Intelligence. Apple Intelligence is free for all users with supported devices. It launched for developers and testers on July 29, 2024, in U.S. English, with the iOS 18.1, macOS 15.1, and iPadOS 18.1 developer betas, released partially in October 2024, and will fully launch by 2025. UK, Australia, Canada, New Zealand, and South African localized versions of English gained support on December 11, 2024, while Chinese, English (India), English (Singapore), French, German, Italian, Japanese, Korean, Portuguese, Spanish, and Vietnamese will be added over the course of 2025. Apple Intelligence is also set to start rolling out in the European Union in April 2025.

DeepSeek is a Chinese artificial intelligence (AI) firm and family of Large Language Models based in Hangzhou. It is founded and backed by the Chinese hedge fund, High-Flyer. It has released its models as open source. The latest version, DeepSeek-V3, is competitive with other LLMs released in 2024 such as that of Qwen and OpenAI.

OpenAI o3 is a generative pre-trained transformer (GPT) model developed by OpenAI as a successor to OpenAI o1. It is designed to devote additional deliberation time when addressing questions that require step-by-step logical reasoning.

References

  1. 1 2 3 Metz, Cade (September 12, 2024). "OpenAI Unveils New ChatGPT That Can Reason Through Math and Science". The New York Times . Retrieved September 12, 2024.
  2. "Introducing OpenAI o1". OpenAI. Retrieved December 6, 2024.
  3. 1 2 Tong, Anna; Paul, Katie (July 15, 2024). "Exclusive: OpenAI working on new reasoning technology under code name 'Strawberry'". Reuters . Retrieved September 12, 2024.
  4. "OpenAI researchers warned board of AI breakthrough ahead of CEO ouster, sources say". Reuters. November 23, 2023.
  5. Peters, Jay (September 12, 2024). "GitHub has started testing OpenAI's o1-preview in GitHub Copilot". The Verge . Retrieved September 12, 2024.
  6. 1 2 Robison, Kylie (December 5, 2024). "OpenAI is charging $200 a month for an exclusive version of its o1 'reasoning' model". The Verge. Retrieved December 5, 2024.
  7. 1 2 Robison, Kylie (September 12, 2024). "OpenAI releases o1, its first model with 'reasoning' abilities". The Verge. Retrieved September 15, 2024.
  8. https://openai.com/index/introducing-openai-o1-preview/
  9. 1 2 Knight, Will. "OpenAI Announces a New AI Model, Code-Named Strawberry, That Solves Difficult Problems Step by Step". Wired. ISSN   1059-1028 . Retrieved September 15, 2024.
  10. "New reasoning models: OpenAI o1-preview and o1-mini". OpenAI Developer Forum. September 12, 2024. Retrieved October 17, 2024.
  11. 1 2 3 "Learning to Reason with LLMs". OpenAI. Archived from the original on September 12, 2024. Retrieved September 13, 2024.
  12. 1 2 Kahn, Jeremy. "Here are 9 things you need to know about OpenAI's o1 model". Fortune. Retrieved September 15, 2024.
  13. Franzen, Carl (September 12, 2024). "Forget GPT-5! OpenAI launches new AI model family o1 claiming PhD-level performance". VentureBeat. Retrieved September 15, 2024.
  14. "OpenAI o1-mini". OpenAI. September 12, 2024.
  15. Coombes, Lloyd (September 13, 2024). "OpenAI's new ChatGPT o1 model 'cheated' on an impossible test — here's what happened". Tom's Guide. Retrieved September 15, 2024.
  16. "OpenAI o1 System Card" (PDF). OpenAI. September 12, 2024. pp. 16–17.
  17. Boran, Marie (September 13, 2024). "OpenAI o1 model warning issued by scientist: "Particularly dangerous"". Newsweek. Retrieved September 15, 2024.
  18. Robison, Kylie (September 17, 2024). "OpenAI's new model is better at reasoning and, occasionally, deceiving". The Verge.
  19. Edwards, Benj (September 16, 2024). "Ban warnings fly as users dare to probe the "thoughts" of OpenAI's latest model". Ars Technica.
  20. Mirzadeh, Iman; Alizadeh, Keivan; Shahrokhi, Hooman; Tuzel, Oncel; Bengio, Samy; Farajtabar, Mehrdad (2024). "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models". arXiv. Retrieved October 15, 2024.
  21. Orland, Kyle (October 14, 2024). "Apple study exposes deep cracks in LLMs' "reasoning" capabilities". Ars Technica. Retrieved October 15, 2024.