Developer(s) | Google DeepMind |
---|---|
Initial release | December 6, 2023 |
Predecessor | PaLM 2 |
Available in | English |
Type | Large language model |
License | Proprietary |
Website | deepmind |
Gemini (formerly known as Bard) is a family of multimodal large language models developed by Google DeepMind, serving as the successor to LaMDA and PaLM 2. Comprising Gemini Ultra, Gemini Pro, Gemini Flash, and Gemini Nano, it was announced on December 6, 2023, positioned as a competitor to OpenAI's GPT-4. It powers the chatbot of the same name.
Google announced Gemini, a large language model (LLM) developed by subsidiary Google DeepMind, during the Google I/O keynote on May 10, 2023. It was positioned as a more powerful successor to PaLM 2, which was also unveiled at the event, with Google CEO Sundar Pichai stating that Gemini was still in its early developmental stages. [1] [2] Unlike other LLMs, Gemini was said to be unique in that it was not trained on a text corpus alone and was designed to be multimodal, meaning it could process multiple types of data simultaneously, including text, images, audio, video, and computer code. [3] It had been developed as a collaboration between DeepMind and Google Brain, two branches of Google that had been merged as Google DeepMind the previous month. [4] In an interview with Wired , DeepMind CEO Demis Hassabis touted Gemini's advanced capabilities, which he believed would allow the algorithm to trump OpenAI's ChatGPT, which runs on GPT-4 and whose growing popularity had been aggressively challenged by Google with LaMDA and Bard. Hassabis highlighted the strengths of DeepMind's AlphaGo program, which gained worldwide attention in 2016 when it defeated Go champion Lee Sedol, saying that Gemini would combine the power of AlphaGo and other Google–DeepMind LLMs. [5]
In August 2023, The Information published a report outlining Google's roadmap for Gemini, revealing that the company was targeting a launch date of late 2023. According to the report, Google hoped to surpass OpenAI and other competitors by combining conversational text capabilities present in most LLMs with artificial intelligence–powered image generation, allowing it to create contextual images and be adapted for a wider range of use cases. [6] Like Bard, [7] Google co-founder Sergey Brin was summoned out of retirement to assist in the development of Gemini, along with hundreds of other engineers from Google Brain and DeepMind; [6] [8] he was later credited as a "core contributor" to Gemini. [9] Because Gemini was being trained on transcripts of YouTube videos, lawyers were brought in to filter out any potentially copyrighted materials. [6]
With news of Gemini's impending launch, OpenAI hastened its work on integrating GPT-4 with multimodal features similar to those of Gemini. [10] The Information reported in September that several companies had been granted early access to "an early version" of the LLM, which Google intended to make available to clients through Google Cloud's Vertex AI service. The publication also stated that Google was arming Gemini to compete with both GPT-4 and Microsoft's GitHub Copilot. [11] [12]
On December 6, 2023, Pichai and Hassabis announced "Gemini 1.0" at a virtual press conference. [13] [14] It comprised three models: Gemini Ultra, designed for "highly complex tasks"; Gemini Pro, designed for "a wide range of tasks"; and Gemini Nano, designed for "on-device tasks". At launch, Gemini Pro and Nano were integrated into Bard and the Pixel 8 Pro smartphone, respectively, while Gemini Ultra was set to power "Bard Advanced" and become available to software developers in early 2024. Other products that Google intended to incorporate Gemini into included Search, Ads, Chrome, Duet AI on Google Workspace, and AlphaCode 2. [15] [14] It was made available only in English. [14] [16] Touted as Google's "largest and most capable AI model" and designed to emulate human behavior, [17] [14] [18] the company stated that Gemini would not be made widely available until the following year due to the need for "extensive safety testing". [13] Gemini was trained on and powered by Google's Tensor Processing Units (TPUs), [13] [16] and the name is in reference to the DeepMind–Google Brain merger as well as NASA's Project Gemini. [19]
Gemini Ultra was said to have outperformed GPT-4, Anthropic's Claude 2, Inflection AI's Inflection-2, Meta's LLaMA 2, and xAI's Grok 1 on a variety of industry benchmarks, [20] [13] while Gemini Pro was said to have outperformed GPT-3.5. [3] Gemini Ultra was also the first language model to outperform human experts on the 57-subject Massive Multitask Language Understanding (MMLU) test, obtaining a score of 90%. [3] [19] Gemini Pro was made available to Google Cloud customers on AI Studio and Vertex AI on December 13, while Gemini Nano will be made available to Android developers as well. [21] [22] [23] Hassabis further revealed that DeepMind was exploring how Gemini could be "combined with robotics to physically interact with the world". [24] In accordance with an executive order signed by U.S. President Joe Biden in October, Google stated that it would share testing results of Gemini Ultra with the federal government of the United States. Similarly, the company was engaged in discussions with the government of the United Kingdom to comply with the principles laid out at the AI Safety Summit at Bletchley Park in November. [3]
Google partnered with Samsung to integrate Gemini Nano and Gemini Pro into its Galaxy S24 smartphone lineup in January 2024. [25] [26] The following month, Bard and Duet AI were unified under the Gemini brand, [27] [28] with "Gemini Advanced with Ultra 1.0" debuting via a new "AI Premium" tier of the Google One subscription service. [29] Gemini Pro also received a global launch. [30]
In February, 2024, Google launched "Gemini 1.5" in a limited capacity, positioned as a more powerful and capable model than 1.0 Ultra. [31] [32] [33] This "step change" was achieved through various technical advancements, including a new architecture, a mixture-of-experts approach, and a larger one-million-token context window, which equates to roughly an hour of silent video, 11 hours of audio, 30,000 lines of code, or 700,000 words. [34] The same month, Google debuted Gemma, a family of free and open-source LLMs that serve as a lightweight version of Gemini. They come in two sizes, with a neural network with two and seven billion parameters, respectively. Multiple publications viewed this as a response to Meta and others open-sourcing their AI models, and a stark reversal from Google's longstanding practice of keeping its AI proprietary. [35] [36] [37] Google announced an additional model, Gemini 1.5 Flash, on May 14th at the 2024 I/O keynote. [38]
Gemma 2 was released on June 27, 2024. [39]
Two updated Gemini models, Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002, were released on September 24, 2024. [40]
On December 11, 2024, Google announced Gemini 2.0 Flash Experimental [41] , a significant update to its Gemini AI model. This iteration boasts improved speed and performance over its predecessor, Gemini 1.5 Flash. Key features include a Multimodal Live API for real-time audio and video interactions, enhanced spatial understanding, native image and controllable text-to-speech generation (with watermarking), and integrated tool use, including Google Search. [42] It also introduces improved agentic capabilities, a new Google Gen AI SDK [43] , and "Jules," an experimental AI coding agent for GitHub. Additionally, Google Colab is integrating Gemini 2.0 to generate data science notebooks from natural language. Gemini 2.0 is available through the Gemini chat interface for all users as "Gemini 2.0 Flash experimental". Gemini 2.0 Flash is currently in an experimental phase, with a wider release expected in early 2025.
The first generation of Gemini ("Gemini 1") has three models, with the same architecture. They are decoder-only transformers, with modifications to allow efficient training and inference on TPUs. They have a context length of 32,768 tokens, with multi-query attention. Two versions of Gemini Nano, Nano-1 (1.8 billion parameters) and Nano-2 (3.25 billion parameters), are distilled from larger Gemini models, designed for use by edge devices such as smartphones. As Gemini is multimodal, each context window can contain multiple forms of input. The different modes can be interleaved and do not have to be presented in a fixed order, allowing for a multimodal conversation. For example, the user might open the conversation with a mix of text, picture, video, and audio, presented in any order, and Gemini might reply with the same free ordering. Input images may be of different resolutions, while video is inputted as a sequence of images. Audio is sampled at 16 kHz and then converted into a sequence of tokens by the Universal Speech Model. Gemini's dataset is multimodal and multilingual, consisting of "web documents, books, and code, and includ[ing] image, audio, and video data". [44]
The second generation of Gemini ("Gemini 1.5") has two models. Gemini 1.5 Pro is a multimodal sparse mixture-of-experts, with a context length in the millions, while Gemini 1.5 Flash is distilled from Gemini 1.5 Pro, with a context length above 2 million. [45]
Gemma 2 27B is trained on web documents, code, science articles. Gemma 2 9B was distilled from 27B. Gemma 2 2B was distilled from a 7B model that remained unreleased. [46]
As of 2024 August, the models released include [47]
Gemini's launch was preluded by months of intense speculation and anticipation, which MIT Technology Review described as "peak AI hype". [49] [20] In August 2023, Dylan Patel and Daniel Nishball of research firm SemiAnalysis penned a blog post declaring that the release of Gemini would "eat the world" and outclass GPT-4, prompting OpenAI CEO Sam Altman to ridicule the duo on X (formerly Twitter). [50] [51] Business magnate Elon Musk, who co-founded OpenAI, weighed in, asking, "Are the numbers wrong?" [52] Hugh Langley of Business Insider remarked that Gemini would be a make-or-break moment for Google, writing: "If Gemini dazzles, it will help Google change the narrative that it was blindsided by Microsoft and OpenAI. If it disappoints, it will embolden critics who say Google has fallen behind." [53]
Reacting to its unveiling in December 2023, University of Washington professor emeritus Oren Etzioni predicted a "tit-for-tat arms race" between Google and OpenAI. Professor Alexei Efros of the University of California, Berkeley praised the potential of Gemini's multimodal approach, [19] while scientist Melanie Mitchell of the Santa Fe Institute called Gemini "very sophisticated". Professor Chirag Shah of the University of Washington was less impressed, likening Gemini's launch to the routineness of Apple's annual introduction of a new iPhone. Similarly, Stanford University's Percy Liang, the University of Washington's Emily Bender, and the University of Galway's Michael Madden cautioned that it was difficult to interpret benchmark scores without insight into the training data used. [49] [54] Writing for Fast Company , Mark Sullivan opined that Google had the opportunity to challenge the iPhone's dominant market share, believing that Apple was unlikely to have the capacity to develop functionality similar to Gemini with its Siri virtual assistant. [55] Google shares spiked by 5.3 percent the day after Gemini's launch. [56] [57]
Google faced criticism for a demonstrative video of Gemini, which was not conducted in real time. [58]
DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British-American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc.. Founded in the UK in 2010, it was acquired by Google in 2014 and merged with Google AI's Google Brain division to become Google DeepMind in April 2023. The company is based in London, with research centres in Canada, France, Germany, and the United States.
Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.
OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its stated mission is to develop "safe and beneficial" artificial general intelligence (AGI), which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom, OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora. Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.
Google AI is a division of Google dedicated to artificial intelligence. It was announced at Google I/O 2017 by CEO Sundar Pichai.
You.com is an AI assistant that began as a personalization-focused search engine. While still offering web search capabilities, You.com has evolved to prioritize a chat-first AI assistant.
LaMDA is a family of conversational large language models developed by Google. Originally developed and introduced as Meena in 2020, the first-generation LaMDA was announced during the 2021 Google I/O keynote, while the second generation was announced the following year.
ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and launched in 2022. It is currently based on the GPT-4o large language model (LLM). ChatGPT can generate human-like conversational responses and enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. It is credited with accelerating the AI boom, which has led to ongoing rapid investment in and public attention to the field of artificial intelligence (AI). Some observers have raised concern about the potential of ChatGPT and similar programs to displace human intelligence, enable plagiarism, or fuel misinformation.
In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with erroneous responses rather than perceptual experiences.
Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.
A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
The AI boom, or AI spring, is an ongoing period of rapid progress in the field of artificial intelligence (AI) that started in the late 2010s before gaining international prominence in the early 2020s. Examples include protein folding prediction led by Google DeepMind as well as large language models and generative AI applications developed by OpenAI.
PaLM is a 540 billion-parameter transformer-based large language model (LLM) developed by Google AI. Researchers also trained smaller versions of PaLM to test the effects of model scale.
Microsoft Copilot is a generative artificial intelligence chatbot developed by Microsoft. Based on the GPT-4 series of large language models, it was launched in 2023 as Microsoft's primary replacement for the discontinued Cortana.
Gemini, formerly known as Bard, is a generative artificial intelligence chatbot developed by Google. Based on the large language model (LLM) of the same name, it was launched in 2023 after being developed as a direct response to the rise of OpenAI's ChatGPT. It was previously based on PaLM, and initially the LaMDA family of large language models.
Mistral AI, headquartered in Paris, France specializes in artificial intelligence (AI) products and focuses on open-weight large language models, (LLMs). Founded in April 2023 by former engineers from Google DeepMind and Meta Platforms, the company has gained prominence as an alternative to proprietary AI systems. Named after the mistral – a powerful, cold wind in southern France – the company emphasized openness and innovation in the AI field. Mistral AI positions itself as an alternative to proprietary models.
Huawei PanGu, PanGu, PanGu-Σ or PanGu-π is a multimodal large language model developed by Huawei. It was announced on July 7, 2023.
Apple Intelligence is an artificial intelligence system developed by Apple Inc. Relying on a combination of on-device and server processing, it was announced on June 10, 2024, at WWDC 2024, as a built-in feature of Apple's iOS 18, iPadOS 18, and macOS Sequoia, which were announced alongside Apple Intelligence. Apple Intelligence is free for all users with supported devices. It launched for developers and testers on July 29, 2024, in U.S. English, with the iOS 18.1, macOS 15.1, and iPadOS 18.1 developer betas, released partially in October 2024, and will fully launch by 2025. UK, Australia, Canada, New Zealand, and South African localized versions of English gained support on December 11, 2024, while Chinese, English (India), English (Singapore), French, German, Italian, Japanese, Korean, Portuguese, Spanish, and Vietnamese will be added over the course of 2025. Apple Intelligence is also set to start rolling out in the European Union in April 2025.
{{cite web}}
: CS1 maint: multiple names: authors list (link)