Whisper (speech recognition system)

Last updated
Whisper (speech recognition system)
Original author(s) OpenAI [1]
Initial releaseSeptember 21, 2022
Repository https://github.com/openai/whisper
Type

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022. [2]

Contents

It is capable of transcribing speech in English and several other languages, [3] and is also capable of translating several non-English languages into English. OpenAI claims that the combination of different training data used in its development has led to improved recognition of accents, background noise and jargon compared to previous approaches. [4]

Whisper is a weakly-supervised deep learning acoustic model, made using an encoder-decoder transformer architecture. [5]

Whisper V2 was released on December 8, 2022. [6] Whisper V3 was released in November 2023, on the OpenAI Dev Day. [7]

Background

Speech recognition has had a long history in research; the first approaches made use of statistical methods, such as dynamic time warping, and later hidden Markov models. At around the 2010s, deep neural network approaches became more common for speech recognition models, which were enabled by the availability of large datasets ("big data") and increased computational performance. [8] Early approaches to deep learning in speech recognition included convolutional neural networks, which were limited due to their inability to capture sequential data, which later led to developments of Seq2seq approaches, which include recurrent neural networks which made use of long short-term memory. [9]

Transformers, introduced in 2017 by Google, displaced many prior state-of-the art approaches to many problems in machine learning, and started becoming the core neural architecture in fields such as language modeling and computer vision; [10] weakly-supervised approaches to training acoustic models were recognized in the early 2020s as promising for speech recognition approaches using deep neural networks. [11]

According to a NYT report, in 2021 OpenAI believed they exhausted sources of higher-quality data to train their large language models and decided to complement scraped web text with transcriptions of YouTube videos and podcasts, and developed Whisper to solve this task. [12]

Training and capabilities

Whisper has been trained using semi-supervised learning on 680,000 hours of multilingual and multitask data, of which about one-fifth (117,000 hours) were non-English audio data. Whisper does not outperform models which specialize in the LibriSpeech dataset, although when tested across many datasets, it is more robust and makes 50% fewer errors than other models. [13]

Whisper has a differing error rate with respect to transcribing different languages, with a higher word error rate in languages not well-represented in the training data. [14]

The model has been used as the base for a unified model for speech recognition and more general sound recognition. [15]

Architecture

The Whisper architecture is based on an encoder-decoder transformer. Input audio is split into 30-second chunks converted into a Mel-frequency cepstrum, which is passed to an encoder. A decoder is trained to predict later text captions. Special tokens are used to perform several tasks, such as phrase-level timestamps. [13]

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches of machine learning and deep learning.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its mission is to develop "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom, OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora. Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.

An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

The machine learning-based attention method simulates how human attention works by assigning varying levels of importance to different words in a sentence. It assigns importance to each word by calculating "soft" weights for the word's numerical representation, known as its embedding, within a specific section of the sentence called the context window to determine its importance. The calculation of these weights can occur simultaneously in models called transformers, or one by one in models known as recurrent neural networks. Unlike "hard" weights, which are predetermined and fixed during training, "soft" weights can adapt and change with each use of the model.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">GPT-1</span> 2018 text-generating language model

Generative Pre-trained Transformer 1 (GPT-1) was the first of OpenAI's large language models following Google's invention of the transformer architecture in 2017. In June 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced that initial model along with the general concept of a generative pre-trained transformer.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

In deep learning, fine-tuning is an approach to transfer learning in which the parameters of a pre-trained model are trained on new data. Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen". A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter–efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen.

References

  1. Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv: 2212.04356 [eess.AS].
  2. Golla, Ramsri Goutham (2023-03-06). "Here Are Six Practical Use Cases for the New Whisper API". Slator. Archived from the original on 2023-03-25. Retrieved 2023-08-12.
  3. Dickson, Ben (2022-10-03). "How will OpenAI's Whisper model impact AI applications?". VentureBeat. Archived from the original on 2023-03-15. Retrieved 2023-08-12.
  4. Wiggers, Kyle (September 21, 2022). "OpenAI open-sources Whisper, a multilingual speech recognition system". TechCrunch. Archived from the original on February 12, 2023. Retrieved February 12, 2023.
  5. Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". p. 3. arXiv: 2212.04356 [eess.AS].
  6. "Announcing the large-v2 model · openai/whisper · Discussion #661". GitHub. Retrieved 2024-01-08.
  7. OpenAI DevDay: Opening Keynote , retrieved 2024-01-08
  8. Yu, Dong; Deng, Li (2014). Automatic speech recognition: a deep learning approach. Signals and communication technology (2015th ed.). London Heidelberg: Springer. p. 9. ISBN   978-1-4471-5778-6.
  9. Siddique, Latif; Zaidi, Aun; Cuayahuitl, Heriberto; Shamshad, Fahad; Shoukat, Moazzam; Qadir, Junaid (2023). "Transformers in Speech Processing: A Survey". arXiv: 2303.11607v1 [cs.CL].
  10. Kamath, Uday; Graham, Kenneth L.; Emara, Wael (2022). Transformers for machine learning: a deep dive. Chapman & Hall/CRC machine learning & pattern recognition (First ed.). Boca Raton London New York: CRC Press, Taylor & Francis Group. pp. xix. ISBN   978-0-367-76734-1.
  11. Paaß, Gerhard; Giesselbach, Sven (2023-02-16). "Foundation Models for Speech, Images, Videos, and Control". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 313–382. arXiv: 2302.08575 . doi:10.1007/978-3-031-23190-2_7. ISBN   978-3-031-23189-6. S2CID   257019816.
  12. Davis, Wes (2024-04-06). "OpenAI transcribed over a million hours of YouTube videos to train GPT-4". The Verge. Retrieved 2024-04-20.
  13. 1 2 "Introducing Whisper". openai.com. 2022-09-21. Archived from the original on 2023-08-20. Retrieved 2023-08-21.
  14. Wiggers, Kyle (2023-03-01). "OpenAI debuts Whisper API for speech-to-text transcription and translation". TechCrunch. Archived from the original on 2023-07-18. Retrieved 2023-08-21.
  15. Yuan, Gong; Khurana, Sameer; Karlinsky, Leonid; Glass, James (2023). "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers". Interspeech 2023. pp. 2798–2802. arXiv: 2307.03183 . doi:10.21437/Interspeech.2023-2193.