This article contains content that is written like an advertisement .(September 2024) |
Company type | Privately held company |
---|---|
Industry | Speech recognition |
Founded | 2006 |
Founder | Tony Robinson |
Headquarters | Cambridge, UK |
Number of locations | Cambridge, UK, London, UK, Chennai, India, Brno, Czech Republic |
Area served | Global |
Key people | Katy Wigdahl (CEO) |
Products | Automatic Speech Recognition (ASR), Cloud-based ASR, Speech-to-text, Autonomous Speech Recognition |
Revenue | 11,342,008 Euro (2021) |
Number of employees | 100-250 |
Website | www |
Speechmatics is a technology company based in Cambridge, England, which develops automatic speech recognition software (ASR) based on recurrent neural networks and statistical language modelling. Speechmatics was originally named Cantab Research Ltd when founded in 2006 by speech recognition specialist Dr. Tony Robinson. [1] [2]
Speechmatics offers its speech recognition for solution and service providers to integrate into their stack regardless of their industry or use case. [3] Businesses use Speechmatics to understand and transcribe human-level speech into text regardless of any gender or demographic barrier. The technology can be deployed on-premises and in public and private cloud. [4] [5]
Speechmatics was founded in 2006 by Tony Robinson who pioneered in the application of recurrent neural networks to speech recognition. [6] [7] [8] He was one of the early people who has discovered the practical capabilities of deep neural networks and how they can be used to benefit speech recognition. [9]
In 2014, the company led the development of a billion-word text corpus for measuring progress in statistical language modelling and placed the corpus into the public domain to help accelerate the development of speech recognition technology. [10]
In 2017, the company announced they had developed a new computational method for creating new language models at speed. [11] Around the same time Speechmatics announced a partnership with Qatar Computing Research Institute (QCRI) to develop advanced Arabic speech to text services. [12]
In 2018, Speechmatics became the first ASR provider to develop a Global English language pack which incorporates all dialects and accents of English into one single model. [13]
In 2019, the company raised £6.35 million in venture capital investment in a Series A funding round. [14] With investment from Albion Venture Capital, IQ Capital, and Amadeus Capital Partners, Speechmatics were able to scale into a fast-growth technology start-up. In the same year, the company wins a Queen's Award for Innovation. [15] [16]
In 2020, Speechmatics began scaling beyond its product development and into physical geographic locations. The company opened offices in Brno, Czech Republic, Denver, USA and Chennai, India. [17] [18]
In March 2021, Speechmatics announced its launch on the Microsoft Azure Marketplace to offer any-context speech recognition technology at scale. The ability to consume Speechmatics’ speech recognition engine directly in the Microsoft Azure technology stack enables businesses to start using the technology quickly without barriers to adoption. [19]
In December 2021, Speechmatics and consumer AI startup Personal.ai announced their partnership to offer individuals a personal AI that empower them to never forget their conversations, spoken notes, reminders, details of what they said during a meeting, and more — no matter the dialect of English that they use or accent that they carry. [20]
In March 2023, Speechmatics released Ursa - a groundbreaking speech-to-text engine setting a new benchmark in transcription accuracy. Ursa, trained on millions of hours of audio data, captures spoken words in noisy and challenging environments. [21]
In July 2024, Speechmatics introduced the world to Flow - the ultimate API for voice interactions. Flow allows businesses to build inclusive, seamless and responsive speech interactions into their products. [22]
In February 2018, Speechmatics launched Global English, a single English language pack supporting all major English accents for use in speech-to-text transcription. Global English (GE) was trained through spoken data by users from 40 countries and billions of words drawn from global sources, making it comprehensive and accurate accent-agnostic transcription solutions on the market. [23] [24]
In November 2020, the company launched the first Global Spanish language pack on the market that supports all major Spanish accents. Global Spanish (GS) is a single Spanish language pack trained on data drawn across a wide range of diverse sources – specifically those from Latin America – making it the most accurate and comprehensive accent-independent Spanish language pack for speech-to-text. [25]
In October 2021, Speechmatics launched its ‘Autonomous Speech Recognition’ software. [26] [27] Using the latest techniques in deep learning and with the introduction of its breakthrough self-supervised models, Speechmatics outperforms Amazon, Apple, Google, and Microsoft in the company's latest step towards its mission to understand all voices. [28] [29]
Speechmatics was named in the FT 1000: Europe's Fastest Growing Companies list for consecutive four years from 2019 to 2022. [30] [31]
In 2018, the company won SME National Business Awards in High Growth Business of the Year. [32]
In 2019, Speechmatics won 2019 Queen's Award for Enterprise in Innovation category. [33] [34]
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.
Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.
Nuance Communications, Inc. is an American multinational computer software technology corporation, headquartered in Burlington, Massachusetts, that markets speech recognition and artificial intelligence software.
Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, as well as an API that helps developers build browser extensions and software applications. As of October 2024, Google Translate supports 243 languages and language varieties at various levels. It served over 200 million people daily in May 2013, and over 500 million total users as of April 2016, with more than 100 billion words translated daily.
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since early 20th century.
Transcription software assists in the conversion of human speech into a text transcript. Audio or video files can be transcribed manually or automatically. Transcriptionists can replay a recording several times in a transcription editor and type what they hear. By using transcription hot keys, the manual transcription can be accelerated, the sound filtered, equalized or have the tempo adjusted when the clarity is not great. With speech recognition technology, transcriptionists can automatically convert recordings to text transcripts by opening recordings in a PC and uploading them to a cloud for automatic transcription, or transcribe recordings in real-time by using digital dictation. Depending on quality of recordings, machine generated transcripts may still need to be manually verified. The accuracy rate of the automatic transcription depends on several factors such as background noises, speakers' distance to the microphone, and accents.
Deep learning is a subset of machine learning methods based on neural networks with representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.
Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.
Braina is a virtual assistant and speech-to-text dictation application for Microsoft Windows developed by Brainasoft. Braina uses natural language interface, speech synthesis, and speech recognition technology to interact with its users and allows them to use natural language sentences to perform various tasks on a computer. The name Braina is a short form of "Brain Artificial".
WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.
Uniphore is a conversational automation technology company. Uniphore sells software for conversational analytics, conversational assistants, and conversational security. The company is headquartered in Palo Alto, California, with offices in the United States, Singapore, India, Japan, Spain, and Israel. Its products are used by up to 75,000 customer service agents during approximately 160 million interactions per month.
Alice is a Russian intelligent personal assistant for Android, iOS and Windows operating systems and Yandex's own devices developed by Yandex. Alice was officially introduced on 10 October 2017. Aside from common tasks, such as internet search or weather forecasts, it can also run applications and chit-chat. Alice is also the virtual assistant used for the Yandex Station smart speaker.
Clarifai is an independent artificial intelligence company that specializes in computer vision, natural language processing, and audio recognition. One of the first deep learning platforms having been founded in 2013, Clarifai provides an AI platform for unstructured image, video, text, and audio data. Its platform supports the full AI lifecycle for data exploration, data labeling, model training, evaluation and inference around images, video, text, and audio data. Headquartered in Washington DC and with employees in the US, Canada, Argentina, Estonia and India Clarifai uses machine learning and deep neural networks to identify and analyze images, videos, text and audio automatically. Clarifai enables users to implement AI technology into their products.
Tony Robinson is a researcher in the application of recurrent neural networks to speech recognition, being one of the first to discover the practical capabilities of deep neural networks and its application to speech recognition.
Amazon Polly is a cloud service by Amazon Web Services, a subsidiary of Amazon.com, that converts text into spoken audio. It allows developers to create speech-enabled applications and products. It was launched in November 2016 and now includes 60 voices across 29 languages, some of which are Neural Text-to-Speech voices of higher quality. Users include Duolingo, a language education platform.
Otter.ai, Inc. is an American transcription software company based in Mountain View, California. The company develops speech to text transcription applications using artificial intelligence and machine learning. Its software, called Otter, shows captions for live speakers, and generates written transcriptions of speech.
An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.
Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.