Voice analysis

Last updated

Voice analysis is the study of speech sounds for purposes other than linguistic content, such as in speech recognition. Such studies include mostly medical analysis of the voice (phoniatrics), but also speaker identification. [1] More controversially, some believe that the truthfulness or emotional state of speakers can be determined using voice stress analysis or layered voice analysis.

Contents

Analysis methods

Voice problems that require voice analysis most commonly originate from the vocal folds or the laryngeal musculature that controls them, since the folds are subject to collision forces with each vibratory cycle and to drying from the air being forced through the small gap between them, and the laryngeal musculature is intensely active during speech or singing and is subject to tiring. However, dynamic analysis of the vocal folds and their movement is physically difficult. The location of the vocal folds effectively prohibits direct, invasive measurement of movement. Less invasive imaging methods such as x-rays or ultrasounds do not work because the vocal cords are surrounded by cartilage, which distorts image quality. Movements in the vocal cords are rapid, fundamental frequencies are usually between 80 and 300 Hz, thus preventing usage of ordinary video. Stroboscopic, and high-speed videos provide an option, but to see the vocal folds a fiberoptic probe leading to the camera must be positioned in the throat, which makes speaking difficult. In addition, placing objects in the pharynx usually triggers a gag reflex that stops voicing and closes the larynx. In addition, stroboscopic imaging is only useful when the vocal fold vibratory pattern is closely periodic.

The most important[ according to whom? ] indirect methods are currently inverse filtering of either microphone or oral airflow recordings and electroglottography (EGG).[ citation needed ] In inverse filtering, the speech sound (the radiated acoustic pressure waveform, as obtained from a microphone) or the oral airflow waveform from a circumferentially vented (CV) mask is recorded outside the mouth and then filtered by a mathematical method to remove the effects of the vocal tract. This method estimates the glottal input of voice production by recording output and using a computational model to invert the effects of the vocal tract. The other kind of noninvasive indirect indication of vocal fold motion is the electroglottography, in which electrodes placed on either side of the subject's throat at the level of the vocal folds record the changes in the conductivity of the throat according to how large a portion of the vocal folds are touching each other. It thus yields one-dimensional information of the contact area. Neither inverse filtering nor EGG are sufficient to completely describe the complex 3-dimensional pattern of vocal fold movement, but can provide useful indirect evidence of that movement.

Another way to conduct voice analysis is to look at voice characteristics. Some characteristics of voice are phonation, pitch, loudness, and rate. These characteristics can be used to evaluate a person's voice and can aid in the voice analysis process. Phonation is typically tested by looking at different types of data collected from a person such as words with long vowels, words with many phonemes, or just typical speech. A person's pitch can be evaluated by making the person produce the highest and lowest sounds they can, as well as sounds in between. A keyboard can be used to aid in this process. Loudness is valuable to look at because for certain people, loudness affects the way they produce certain sounds. Some people need to speak louder for certain phonemes in comparison to others just so they can produce them.[ citation needed ] This can be tested by asking the person to use the same amount of loudness while singing a scale. Rate is also important because it looks at how fast or slow a person speaks.

[2]

Use in medicine

A medical study of the voice can be, for instance, analysis of the voice of patients who have had a polyp removed from their vocal cords through an operation. Computerized methods can be used to assess such issues in an objective manner. [3] An experienced voice therapist can quite reliably evaluate the voice, but this requires extensive training and is still subjective.

Another active research topic in medical voice analysis is vocal loading evaluation. The vocal cords of a person who speaks for an extended time suffer from tiring—that is, the process of speaking exerts a load on the vocal cords and tires the tissue. Among professional voice users (e.g., teachers, sales people) this tiring can cause voice failures and sick leaves. Voice analysis has been studied as an objective means to evaluate such problems. [4]

Voice analysis was an important factor in the study of vocal cord paralysis. It effects different functions of the vocal cords, from speech to breathing and voice analysis is used to study the effectiveness of Thyroplasty (medialization thyroplasty) improvements on the vocal cords after the surgery. Traditional voice recording is used in pre-operation to record the voices of chosen patients to be compared with the post-operation usage, along with more complex recordings using an electroglottograpy, photoglottography, [5] and videokymography. Medical professionals have the ability to read and understand the results from the complex recordings, but knowledge from a voice professional is needed within these experiments for accurate results. Voice experts were an important to tie the physical examination of the vocal cords to the neurological examination to ensure the success of the surgery because of their trained ear. Perceptual evaluation of voice is heavily reliant on voice quality, a factor assessed preferably by voice specialists (speech therapists). A professional voice analyzer has a trained ear and can block out excess variants that can be deceptive from the results. [6]

Use in forensics

Voice analysis is used in a branch of forensic science called audio forensics. These analyses are generally performed on evidence for the purposes of evaluating the authenticity of the audio in question, enhancing features of the audio that may be hidden beneath distracting background noise, interpreting the audio from the perspective of a forensic expert, [7] or in some cases for the purposes of speaker identification. [8]

An expert will employ a variety of techniques in their analysis. The minimum of procedures are "critical listening, waveform analysis, and spectral analysis". [9] Critical listening involves a thorough breakdown of both foreground and background sounds through repetitive listening. [9] Waveform analysis visualizes the audio for the examiner to see any irregularities that may occur. Spectral analysis visualizes the frequency of the audio for an examiner to pick out features of interest. [9]

One case in which audio played a larger role is the Trayvon Martin case, where a recording of a call made to the police was analyzed to determine if background screams came from George Zimmerman or from Martin.

Forensic Voice

Experts in forensic voice analyze recordings by examining transmitted and stored speech, enhancing it and decoding it for criminal investigations, court trials, and federal agencies.

To utilize audio recordings in court, a forensic phonetician must authenticate the recording to detect tampering, enhance the audio, and interpret the speech. Their first job is to ensure that the speech in the recording being used is comprehensible. Oftentimes, samples have poor sound quality due to environmental factors such as wind or movement. Other times the sound degradation is due to technological issues within the recording device. Any investigative work on speaker identification cannot be done until the recording is of proper quality. Different solutions for poor comprehensibility are done using computer programs that allow the user to filter and eliminate noise. Computer software is also able to convert the speech to spectra and waveforms, which is useful for the forensic phonetician. However, any work done on the recording should be done after a copy of the original recording is made.

A main part of the forensic phonetician's job is speaker identification. The interpretation process might include piecing together a timeline, transcribing the dialog, and identifying unknown or unintelligible sounds in the audio recording. In court, the expert ultimately serves to explain the facts surrounding the audio evidence, providing an explanation of relevant acoustical and physical principles to explain what is evidenced by the recording. Reports are made to include detailed information, if there is a section of the recording that is not comprehensible or is inaudible, an explanation of what was happening (in the recording), and a description of what is missing from the recording.

Speaker Identification

Voice analysis has a role in speaker identification. This is when the identity of a speaker is unknown, and has to be identify from an array of other voices or suspects when pertaining to a crime investigation or court trial.  Proper identification of speaker and voices particularly for criminal cases depend on a list of factors, like familiarity, exposure, delay, tone of voice, voice disguising, and accents. Familiarity with a speaker increases the chances of properly identifying a voice, and distinguishing it. The amount of exposure to a voice also aids in correctly identifying a voice, even if it is an unfamiliar one. A hearer that listen to a longer utterance or was exposed to a voice more often is better at recognizing a voice, than someone who perhaps was only able to hear one word. A delay between the time of hearing a voice and the time of identifying the speaker also decreases the prospect of identifying the correct speaker. The tone of voice affects the ability to identify the right speaker. If the tone does not match that of the speaker at the time of comparison, it will prove to be more difficult to analyse. Disguise of the voice, for example when a speaker is whispering, will also hinder the ability to accurately match and identify the speaker. In some cases, individuals who speak the same language as the speaker whose voice is being analysed will have an easier time identifying them because of the accent and stress of the voice. Speaker identification is additionally complicated by distortions from the technical method of recording and speaker based issues, such as emotional states or alternative motives causing a discrepancy between their voice and that of a recording. The methods of speaker identification in forensics include the use of earwitnesses who are used to identify voices they have heard, the aural-perceptual approach conducted by a specialist regarding the suprasegmentals of an individual's speech, and computer-based approaches.

See also

Related Research Articles

<span class="mw-page-title-main">Phonetics</span> Study of the sounds of human language

Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. The field of phonetics is traditionally divided into three sub-disciplines based on the research questions involved such as how humans plan and execute movements to produce speech, how various movements affect the properties of the resulting sound, or how humans convert sound waves to linguistic information. Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language which differs from the phonological unit of phoneme; the phoneme is an abstract categorization of phones, and it is also defined as the smallest unit that discerns meaning between sounds in any given language.

The term phonation has slightly different meanings depending on the subfield of phonetics. Among some phoneticians, phonation is the process by which the vocal folds produce certain sounds through quasi-periodic vibration. This is the definition used among those who study laryngeal anatomy and physiology and speech production in general. Phoneticians in other subfields, such as linguistic phonetics, call this process voicing, and use the term phonation to refer to any oscillatory state of any part of the larynx that modifies the airstream, of which voicing is just one example. Voiceless and supra-glottal phonations are included under this definition.

Vocal loading is the stress inflicted on the speech organs when speaking for long periods.

<span class="mw-page-title-main">Vocoder</span> Voice encryption, transformation, and synthesis device

A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

<span class="mw-page-title-main">Human voice</span> Sound made by a human being using the vocal tract

The human voice consists of sound made by a human being using the vocal tract, including talking, singing, laughing, crying, screaming, shouting, humming or yelling. The human voice frequency is specifically a part of human sound production in which the vocal folds are the primary sound source.

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed. Pitch shift is pitch scaling implemented in an effects unit and intended for live performance. Pitch control is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording.

This is a glossary of medical terms related to communication disorders which are psychological or medical conditions that could have the potential to affect the ways in which individuals can hear, listen, understand, speak and respond to others.

Falsetto is the vocal register occupying the frequency range just above the modal voice register and overlapping with it by approximately one octave.

<span class="mw-page-title-main">Spectrogram</span> Visual representation of the spectrum of frequencies of a signal as it varies with time

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represented in a 3D plot they may be called waterfall displays.

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Acoustic phonetics is a subfield of phonetics, which deals with acoustic aspects of speech sounds. Acoustic phonetics investigates time domain features such as the mean squared amplitude of a waveform, its duration, its fundamental frequency, or frequency domain features such as the frequency spectrum, or even combined spectrotemporal features and the relationship of these properties to other branches of phonetics, and to abstract linguistic concepts such as phonemes, phrases, or utterances.

<span class="mw-page-title-main">Bogart–Bacall syndrome</span> Voice disorder caused by abuse or overuse of the vocal cords

Bogart–Bacall syndrome (BBS) is a voice disorder that is caused by abuse or overuse of the vocal cords.

Puberphonia is a functional voice disorder that is characterized by the habitual use of a high-pitched voice after puberty, hence why many refer to the disorder as resulting in a 'falsetto' voice. The voice may also be heard as breathy, rough, and lacking in power. The onset of puberphonia usually occurs in adolescence, between the ages of 11 and 15 years, at the same time as changes related to puberty are occurring. This disorder usually occurs in the absence of other communication disorders.

<span class="mw-page-title-main">Voice therapy</span> Used to aid voice disorders or altering quality of voice

Voice therapy consists of techniques and procedures that target vocal parameters, such as vocal fold closure, pitch, volume, and quality. This therapy is provided by speech-language pathologists and is primarily used to aid in the management of voice disorders, or for altering the overall quality of voice, as in the case of transgender voice therapy. Vocal pedagogy is a related field to alter voice for the purpose of singing. Voice therapy may also serve to teach preventive measures such as vocal hygiene and other safe speaking or singing practices.

Semantic audio is the extraction of meaning from audio signals. The field of semantic audio is primarily based around the analysis of audio to create some meaningful metadata, which can then be used in a variety of different ways.

<span class="mw-page-title-main">Audio forensics</span>

Audio forensics is the field of forensic science relating to the acquisition, analysis, and evaluation of sound recordings that may ultimately be presented as admissible evidence in a court of law or some other official venue.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

<span class="mw-page-title-main">Oral skills</span>

Oral skills are speech enhancers that are used to produce clear sentences that are intelligible to an audience. Oral skills are used to enhance the clarity of speech for effective communication. Communication is the transmission of messages and the correct interpretation of information between people. The production speech is insisted by the respiration of air from the lungs that initiates the vibrations in the vocal cords. The cartilages in the larynx adjust the shape, position and tension of the vocal cords. Speech enhancers are used to improve the clarity and pronunciation of speech for correct interpretation of speech. The articulation of voice enhances the resonance of speech and enables people to speak intelligibly. Speaking at a moderate pace and using clear pronunciation improves the phonation of sounds. The term "phonation" means the process to produce intelligible sounds for the correct interpretation of speech. Speaking in a moderate tone enables the audience to process the information word for word.

An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

References

  1. Sarangi, Susanta; Sahidullah, Md; Saha, Goutam (September 2020). "Optimization of data-driven filterbank for automatic speaker verification". Digital Signal Processing. 104: 102795. arXiv: 2007.10729 . doi:10.1016/j.dsp.2020.102795. S2CID   220665533.
  2. Hapner, Edie; Stemple, Joseph (2014). Voice Therapy: Clinical Case Studies. Plural Publishing.
  3. Toran, SiKC; Lal, B. K. (2010). "Objective voice analysis for vocal polyps following microlaryngeal phonosurgery". Kathmandu University Medical Journal. 8 (2): 185–189. doi: 10.3126/kumj.v8i2.3555 . ISSN   1812-2078. PMID   21209532.
  4. Stemple, Joseph C.; Stanley, Jennifer; Lee, Linda (1995). "Objective measures of voice production in normal subjects following prolonged voice use". Journal of Voice. 9 (2): 127–133. doi:10.1016/s0892-1997(05)80245-0. ISSN   0892-1997. PMID   7620534.
  5. Gerratt, Bruce R.; Hanson, David G.; Berke, Gerald S.; Precoda, Kristin (1991-01-01). "Photoglottography: A clinical synopsis". Journal of Voice. 5 (2): 98–105. doi:10.1016/S0892-1997(05)80173-0 . Retrieved 2020-12-16.
  6. Chowdhury, Kanishka; Saha, Somnath; Saha, Vedula Padmini; Pal, Sudipta; Chatterjee, Indranil (2013-03-23). "Pre and Post Operative Voice Analysis After Medialization Thyroplasty in Cases of Unilateral Vocal Fold Paralysis". Indian Journal of Otolaryngology and Head & Neck Surgery. 65 (4): 354–357. doi:10.1007/s12070-013-0649-3. ISSN   2231-3796. PMC   3851511 . PMID   24427598.
  7. Maher, Robert C. (2018). Principles of Forensic Audio Analysis. Modern Acoustics and Signal Processing. Cham: Springer International Publishing. pp. 1–2. doi:10.1007/978-3-319-99453-6. ISBN   978-3-319-99452-9.
  8. Solan, Lawrence M.; Tiersma, Peter M. (2004). Speaking of Crime. University of Chicago Press. doi:10.7208/chicago/9780226767871.001.0001. ISBN   978-0-226-76793-2.
  9. 1 2 3 Maher, Robert C. (2018). Principles of Forensic Audio Analysis. Modern Acoustics and Signal Processing. Cham: Springer International Publishing. pp. 48–49. doi:10.1007/978-3-319-99453-6. ISBN   978-3-319-99452-9.