This article needs additional citations for verification .(January 2012) |
Speaker diarisation (or diarization) is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. [1] It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. [2] It is used to answer the question "who spoke when?" [3] Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.
With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarisation has received much attention by the speech community, as is manifested by the specific evaluations devoted to it under the auspices of the National Institute of Standards and Technology for telephone speech, broadcast news and meetings. [4]
In speaker diarisation, one of the most popular methods is to use a Gaussian mixture model to model each of the speakers, and assign the corresponding frames for each speaker with the help of a Hidden Markov Model. There are two main kinds of clustering scenario. The first one is by far the most popular and is called Bottom-Up. The algorithm starts in splitting the full audio content in a succession of clusters and progressively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second clustering strategy is called top-down and starts with one single cluster for all the audio data and tries to split it iteratively until reaching a number of clusters equal to the number of speakers. A 2010 review can be found at .
More recently, speaker diarisation is performed via neural networks leveraging large-scale GPU computing and methodological developments in deep learning. [5]
There are some open source initiatives for speaker diarisation (in alphabetical order):
Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Voice activity detection (VAD), also known as speech activity detection or speech detection, is the detection of the presence or absence of human speech, used in speech processing. The main uses of VAD are in speaker diarization, speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol (VoIP) applications, saving on computation and on network bandwidth.
eric is a free integrated development environment (IDE) used for computer programming. Since it is a full featured IDE, it provides by default all necessary tools needed for the writing of code and for the professional management of a software project.
Dragon NaturallySpeaking is a speech recognition software package developed by Dragon Systems of Newton, Massachusetts, which was acquired in turn by Lernout & Hauspie Speech Products, Nuance Communications, and Microsoft. It runs on Windows personal computers. Version 15, which supports 32-bit and 64-bit editions of Windows 7, 8 and 10, was released in August 2016.
Fuzzy clustering is a form of clustering in which each data point can belong to more than one cluster.
ITK is a cross-platform, open-source application development framework widely used for the development of image segmentation and image registration programs. Segmentation is the process of identifying and classifying data found in a digitally sampled representation. Typically the sampled representation is an image acquired from such medical instrumentation as CT or MRI scanners. Registration is the task of aligning or developing correspondences between data. For example, in the medical environment, a CT scan may be aligned with an MRI scan in order to combine the information contained in both.
Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.
As of the early 2000s, several speech recognition (SR) software packages exist for Linux. Some of them are free and open-source software and others are proprietary software. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for communicating operational commands to a computer.
In computer science, Orfeo Toolbox (OTB) is a software library for processing images from Earth observation satellites.
Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.
Speech enhancement aims to improve speech quality by using various algorithms. The objective of enhancement is improvement in intelligibility and/or overall perceptual quality of degraded speech signal using audio signal processing techniques.
Semantic audio is the extraction of meaning from audio signals. The field of semantic audio is primarily based around the analysis of audio to create some meaningful metadata, which can then be used in a variety of different ways.
The Point Cloud Library (PCL) is an open-source library of algorithms for point cloud processing tasks and 3D geometry processing, such as occur in three-dimensional computer vision. The library contains algorithms for filtering, feature estimation, surface reconstruction, 3D registration, model fitting, object recognition, and segmentation. Each module is implemented as a smaller library that can be compiled separately. PCL has its own data format for storing point clouds - PCD, but also allows datasets to be loaded and saved in many other formats. It is written in C++ and released under the BSD license.
In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).
Stephen John Young is a British researcher, Professor of Information Engineering at the University of Cambridge and an entrepreneur. He is one of the pioneers of automated speech recognition and statistical spoken dialogue systems. He served as the Senior Pro-Vice-Chancellor of the University of Cambridge from 2009 to 2015, responsible for planning and resources. From 2015 to 2019, he held a joint appointment between his professorship at Cambridge and Apple, where he was a senior member of the Siri development team.
openSMILE is source-available software for automatic extraction of features from audio signals and for classification of speech and music signals. "SMILE" stands for "Speech & Music Interpretation by Large-space Extraction". The software is mainly applied in the area of automatic emotion recognition and is widely used in the affective computing research community. The openSMILE project exists since 2008 and is maintained by the German company audEERING GmbH since 2013. openSMILE is provided free of charge for research purposes and personal use under a source-available license. For commercial use of the tool, the company audEERING offers custom license options.
Voice computing is the discipline that develops hardware or software to process voice inputs.
OpenVINO toolkit is a free toolkit facilitating the optimization of a deep learning model from a framework and deployment using an inference engine onto Intel hardware. The toolkit has two versions: OpenVINO toolkit, which is supported by open source community and Intel Distribution of OpenVINO toolkit, which is supported by Intel. OpenVINO was developed by Intel. The toolkit is cross-platform and free for use under Apache License version 2.0. The toolkit enables a write-once, deploy-anywhere approach to deep learning deployments on Intel platforms, including CPU, integrated GPU, Intel Movidius VPU, and FPGAs.
An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.