Whisper (speech recognition system)

Last updated
Whisper (speech recognition system)
Original author(s) OpenAI [1]
Initial releaseSeptember 21, 2022
Repository https://github.com/openai/whisper
Type

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022. [2]

Contents

It is capable of transcribing speech in English and several other languages, [3] and is also capable of translating several non-English languages into English. OpenAI claims that the combination of different training data used in its development has led to improved recognition of accents, background noise and jargon compared to previous approaches. [4]

Whisper is a weakly-supervised deep learning acoustic model, made using an encoder-decoder transformer architecture. [5]

Whisper V2 was released on December 8, 2022. [6] Whisper V3 was released in November 2023, on the OpenAI Dev Day. [7]

Background

Speech recognition has had a long history in research; the first approaches made use of statistical methods, such as dynamic time warping, and later hidden Markov models. At around the 2010s, deep neural network approaches became more common for speech recognition models, which were enabled by big data and increased computational performance. [8] Early approaches to deep learning in speech recognition included convolutional neural networks, which were limited due to their inability to capture sequential data, which later led to developments of Seq2seq approaches, which include recurrent neural networks which made use of long short-term memory. [9]

Transformers, introduced in 2017 by Google, displaced many prior state-of-the art approaches to many problems in machine learning, and started becoming the core neural architecture in fields such as language modeling and computer vision; [10] weakly-supervised approaches to training acoustic models were recognized in the early 2020s as promising for speech recognition approaches using deep neural networks. [11]

Training and capabilities

Whisper has been trained using semi-supervised learning on 680,000 hours of multilingual and multitask data, of which about one-fifth (117,000 hours) were non-English audio data. Whisper does not outperform models which specialize in the LibriSpeech dataset, although when tested across many datasets, it is more robust and makes 50% fewer errors than other models. [12]

Whisper has a differing error rate with respect to transcribing different languages, with a higher word error rate in languages not well-represented in the training data. [13]

The model has been used as the base for a unified model for speech recognition and more general sound recognition. [14]

Architecture

The Whisper architecture is based on an encoder-decoder transformer. Input audio is split into 30-second chunks converted into a Mel-frequency cepstrum, which is passed to an encoder. A decoder is trained to predict later text captions. Special tokens are used to perform several tasks, such as phrase-level timestamps. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Artificial neural network</span> Computational model used in machine learning, based on connected, hierarchical functions

Artificial neural networks are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, Google Brain combined open-ended machine learning research with information systems and large-scale computing resources. The team has created tools such as TensorFlow, which allow for neural networks to be used by the public, with multiple internal AI research projects. The team aims to create research opportunities in machine learning and natural language processing. The team was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">OpenAI</span> Artificial intelligence research organization

OpenAI is a U.S. based artificial intelligence (AI) research organization founded in December 2015, researching artificial intelligence with the goal of developing "safe and beneficial" artificial general intelligence, which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As one of the leading organizations of the AI Spring, it has developed several large language models, advanced image generation models, and previously, released open-source models. Its release of ChatGPT has been credited with starting the artificial intelligence spring.

<span class="mw-page-title-main">Transformer (machine learning model)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture based on the multi-head attention mechanism. It is notable for not containing any recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl. Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. Though the transformer paper was published in 2017, the softmax-based attention mechanism was proposed in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, was proposed in 1992.

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

Machine learning-based attention is a mechanism which intuitively mimicks cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. These weights can be computed either in parallel or sequentially. "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

Neuro-symbolic AI is a type of artificial intelligence that integrates neural and symbolic AI architectures to address the weaknesses of each, providing a robust AI capable of reasoning, learning, and cognitive modeling. As argued by Leslie Valiant and others, the effective construction of rich computational cognitive models demands the combination of symbolic reasoning and efficient machine learning. Gary Marcus, argued, "We cannot construct rich cognitive models in an adequate, automated way without the triumvirate of hybrid architecture, rich prior knowledge, and sophisticated techniques for reasoning." Further, "To build a robust, knowledge-driven approach to AI we must have the machinery of symbol manipulation in our toolkit. Too much useful knowledge is abstract to proceed without tools that represent and manipulate abstraction, and to date, the only known machinery that can manipulate such abstract knowledge reliably is the apparatus of symbol manipulation."

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">GPT-1</span> 2018 large language model

Generative Pre-trained Transformer 1 (GPT-1) was the first of OpenAI's large language models following Google's invention of the transformer architecture in 2017. In June 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced that initial model along with the general concept of a generative pre-trained transformer.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

In deep learning, fine-tuning is an approach to transfer learning in which the weights of a pre-trained model are trained on new data. Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen". A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter–efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen.

References

  1. Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv: 2212.04356 [eess.AS].
  2. Golla, Ramsri Goutham (2023-03-06). "Here Are Six Practical Use Cases for the New Whisper API". Slator. Archived from the original on 2023-03-25. Retrieved 2023-08-12.
  3. Dickson, Ben (2022-10-03). "How will OpenAI's Whisper model impact AI applications?". VentureBeat. Archived from the original on 2023-03-15. Retrieved 2023-08-12.
  4. Wiggers, Kyle (September 21, 2022). "OpenAI open-sources Whisper, a multilingual speech recognition system". TechCrunch. Archived from the original on February 12, 2023. Retrieved February 12, 2023.
  5. Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022-12-06). "Robust Speech Recognition via Large-Scale Weak Supervision". p. 3. arXiv: 2212.04356 [eess.AS].
  6. "Announcing the large-v2 model · openai/whisper · Discussion #661". GitHub. Retrieved 2024-01-08.
  7. OpenAI DevDay: Opening Keynote , retrieved 2024-01-08
  8. Yu, Dong; Deng, Li (2014). Automatic speech recognition: a deep learning approach. Signals and communication technology (2015th ed.). London Heidelberg: Springer. p. 9. ISBN   978-1-4471-5778-6.
  9. Siddique, Latif; Zaidi, Aun; Cuayahuitl, Heriberto; Shamshad, Fahad; Shoukat, Moazzam; Qadir, Junaid (2023). "Transformers in Speech Processing: A Survey". arXiv: 2303.11607v1 [cs.CL].
  10. Kamath, Uday; Graham, Kenneth L.; Emara, Wael (2022). Transformers for machine learning: a deep dive. Chapman & Hall/CRC machine learning & pattern recognition (First ed.). Boca Raton London New York: CRC Press, Taylor & Francis Group. pp. xix. ISBN   978-0-367-76734-1.
  11. Paaß, Gerhard; Giesselbach, Sven (2023-02-16). "Foundation Models for Speech, Images, Videos, and Control". Foundation Models for Natural Language Processing. Artificial Intelligence: Foundations, Theory, and Algorithms. pp. 313–382. arXiv: 2302.08575 . doi:10.1007/978-3-031-23190-2_7. ISBN   978-3-031-23189-6. S2CID   257019816.
  12. 1 2 "Introducing Whisper". openai.com. 2022-09-21. Archived from the original on 2023-08-20. Retrieved 2023-08-21.
  13. Wiggers, Kyle (2023-03-01). "OpenAI debuts Whisper API for speech-to-text transcription and translation". TechCrunch. Archived from the original on 2023-07-18. Retrieved 2023-08-21.
  14. Yuan, Gong; Khurana, Sameer; Karlinsky, Leonid; Glass, James (2023). "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers". Interspeech 2023. pp. 2798–2802. arXiv: 2307.03183 . doi:10.21437/Interspeech.2023-2193.