Time-compressed speech

Last updated

Time-compressed speech refers to an audio recording of verbal text in which the text is presented in a much shorter time interval than it would through normally-paced real time speech. [1] The basic purpose is to make recorded speech contain more words in a given time, yet still be understandable. For example: a paragraph that might normally be expected to take 20 seconds to read, might instead be presented in 15 seconds, which would represent a time-compression of 25% (5 seconds out of 20).

Contents

The term "time-compressed speech" should not be confused with "speech compression", which controls the volume range of a sound, but does not alter its time envelope.

Methods

While some voice talents are capable of speaking at rates significantly in excess of general norms, [2] [3] the term "time-compressed speech" most usually refers to examples in which the time-reduction has been accomplished through some form of electronic processing of the recorded speech. [4] [5]

In general, recorded speech can be electronically time-compressed by: increasing its speed (linear compression); removing silences (selective editing); a combination of the two (non-linear compression). [5] The speed of a recording can be increased, which will cause the material to be presented at a faster rate (and hence in a shorter amount of time), but this has the undesirable side-effect of increasing the frequency of the whole passage, raising the pitch of the voices, which can reduce intelligibility.

There are normally silences between words and sentences, and even small silences within certain words, both of which can be reduced or removed ("edited-out") which will also reduce the amount of time occupied by the full speech recording. However, this can also have the effect of removing verbal "punctuation" from the speech, causing words and sentences to run together unnaturally, again reducing intelligibility.

Vowels are typically held a minimum of 20 milliseconds, over many cycles of the fundamental pitch. DSP systems can detect the beginning and end of each cycle and then skip over some fraction of those cycles, causing the material to be presented at a faster rate, without changing the pitch, maintaining a "normal" tone of voice. [6]

The current preferred method of time-compression is called "non-linear compression", which employs a combination of selectively removing silences; speeding up the speech to make the reduced silences sound normally-proportioned to the text; and finally applying various data algorithms to bring the speech back down to the proper pitch. [5] This produces a more acceptable result than either of the two earlier techniques; however, if unrestrained, removing the silences and increasing the speed can make a selection of speech sound more insistent, possibly to the point of unpleasantness. [7]

Applications

Advertising

Time-compressed speech is frequently used in television and radio advertising. The advantage of time-compressed speech is that the same number of words can be compressed into a smaller amount of time, reducing advertising costs, and/or allowing more information to be included in a given radio or TV advertisement. It is usually most noticeable in the information-dense caveats and disclaimers presented (usually by legal requirement) at the end of commercials—the aural equivalent of the "fine print" in a printed contract. [8] This practice, however, is not new: before electronic methods were developed, spokespeople who could talk extremely quickly and still be understood were widely used as voice talents for radio and TV advertisements, and especially for recording such disclaimers.

Education

Time-compressed speech has educational applications such as increasing the information density of trainings, and as a study aid. A number of studies have demonstrated that the average person is capable of relatively easily comprehending speech delivered at higher-than-normal rates, with the peak occurring at around 25% compression (that is, 25% faster than normal); this facility has been demonstrated in several languages. [9] Conversational speech (in English) takes place at a rate of around 150 wpm (words per minute), but the average person is able to comprehend speech presented at rates of up to 200-250 wpm without undue difficulty. [10] [11] Blind and severely visually impaired subjects scored similar comprehension levels at even higher rates, up to 300-350 wpm. [12] Blind people have been found to use time-compressed speech extensively, for example, when reviewing recorded lectures from high school and college classes, or professional trainings. Comprehension rates in older blind subjects have been found to be as good, or in some cases better than those found in younger sighted subjects. [13]

Other studies have determined that the ability to comprehend highly time-compressed speech tends to fall off with increased age, [14] and is also reduced when the language of the time-compressed speech is not the listener's native language. [15] Non-native speakers can, however, improve their comprehension level of time-compressed speech with multiday training. [16]

Voice Mail

Voice mail systems have employed time-compressed speech since as far back as the 1970s. In this application, the technology enables the rapid review of messages in high-traffic systems, by a relatively small number of people. [17]

Streaming Multimedia

Time-compressed speech has been explored as one of a variety of interrelated factors which may be manipulated to increase the efficiency of streaming multimedia presentations, by significantly reducing the latency times involved in the transfer of large digitally encoded media files. [18]

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

<span class="mw-page-title-main">Lossy compression</span> Data compression approach that reduces data size while discarding or changing some of it

In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size for storing, handling, and transmitting content. The different versions of the photo of the cat on this page show how higher degrees of approximation create coarser images as more details are removed. This is opposed to lossless data compression which does not degrade the data. The amount of data reduction possible using lossy compression is much higher than using lossless techniques.

<span class="mw-page-title-main">Morse code</span> Transmission of language with brief pulses

Morse code is a method used in telecommunication to encode text characters as standardized sequences of two different signal durations, called dots and dashes, or dits and dahs. Morse code is named after Samuel Morse, one of the inventors of the telegraph.

Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model.

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling is the opposite: the process of changing the pitch without affecting the speed. Pitch shift is pitch scaling implemented in an effects unit and intended for live performance. Pitch control is a simpler process which affects pitch and speed simultaneously by slowing down or speeding up a recording.

A communication disorder is any disorder that affects an individual's ability to comprehend, detect, or apply language and speech to engage in dialogue effectively with others. The delays and disorders can range from simple sound substitution to the inability to understand or use one's native language.

<span class="mw-page-title-main">Automatic gain control</span> Electronic circuit to automatically adjust signal strength

Automatic gain control (AGC) is a closed-loop feedback regulating circuit in an amplifier or chain of amplifiers, the purpose of which is to maintain a suitable signal amplitude at its output, despite variation of the signal amplitude at the input. The average or peak output signal level is used to dynamically adjust the gain of the amplifiers, enabling the circuit to work satisfactorily with a greater range of input signal levels. It is used in most radio receivers to equalize the average volume (loudness) of different radio stations due to differences in received signal strength, as well as variations in a single station's radio signal due to fading. Without AGC the sound emitted from an AM radio receiver would vary to an extreme extent from a weak to a strong signal; the AGC effectively reduces the volume if the signal is strong and raises it when it is weaker. In a typical receiver the AGC feedback control signal is usually taken from the detector stage and applied to control the gain of the IF or RF amplifier stages.

<span class="mw-page-title-main">Dynamic range compression</span> Audio signal processing operation

Dynamic range compression (DRC) or simply compression is an audio signal processing operation that reduces the volume of loud sounds or amplifies quiet sounds, thus reducing or compressing an audio signal's dynamic range. Compression is commonly used in sound recording and reproduction, broadcasting, live sound reinforcement and in some instrument amplifiers.

<span class="mw-page-title-main">Speed reading</span> Techniques claiming to improve the ability to read quickly

Speed reading is any of many techniques claiming to improve one's ability to read quickly. Speed-reading methods include chunking and minimizing subvocalization. The many available speed-reading training programs may utilize books, videos, software, and seminars. There is little scientific evidence regarding speed reading, and as a result its value seems uncertain. Cognitive neuroscientist Stanislas Dehaene says that claims of reading up to 1,000 words per minute "must be viewed with skepticism".

<span class="mw-page-title-main">Typing</span> Text input method

Typing is the process of writing or inputting text by pressing keys on a typewriter, computer keyboard, cell phone, or calculator. It can be distinguished from other means of text input, such as handwriting and speech recognition. Text can be in the form of letters, numbers and other symbols. The world's first typist was Lillian Sholes from Wisconsin, the daughter of Christopher Sholes, who invented the first practical typewriter.

Mixed-excitation linear prediction (MELP) is a United States Department of Defense speech coding standard used mainly in military applications and satellite communications, secure voice, and secure radio devices. Its standardization and later development was led and supported by the NSA and NATO.

In cognitive psychology, fast mapping is the term used for the hypothesized mental process whereby a new concept is learned based only on minimal exposure to a given unit of information. Fast mapping is thought by some researchers to be particularly important during language acquisition in young children, and may serve to explain the prodigious rate at which children gain vocabulary. In order to successfully use the fast mapping process, a child must possess the ability to use "referent selection" and "referent retention" of a novel word. There is evidence that this can be done by children as young as two years old, even with the constraints of minimal time and several distractors. Previous research in fast mapping has also shown that children are able to retain a newly learned word for a substantial amount of time after they are subjected to the word for the first time. Further research by Markson and Bloom (1997), showed that children can remember a novel word a week after it was presented to them even with only one exposure to the novel word. While children have also displayed the ability to have equal recall for other types of information, such as novel facts, their ability to extend the information seems to be unique to novel words. This suggests that fast mapping is a specified mechanism for word learning. The process was first formally articulated and the term 'fast mapping' coined Susan Carey and Elsa Bartlett in 1978.

Words per minute, commonly abbreviated wpm, is a measure of words processed in a minute, often used as a measurement of the speed of typing, reading or Morse code sending and receiving.

<span class="mw-page-title-main">Subvocalization</span> Internal process while reading

Subvocalization, or silent speech, is the internal speech typically made when reading; it provides the sound of the word as it is read. This is a natural process when reading and it helps the mind to access meanings to comprehend and remember what is read, potentially reducing cognitive load.

In linguistics, prosody is concerned with elements of speech that are not individual phonetic segments but are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

<span class="mw-page-title-main">Secure voice</span> Encrypted voice communication

Secure voice is a term in cryptography for the encryption of voice communication over a range of communication types such as radio, telephone or IP.

Voice activity detection (VAD), also known as speech activity detection or speech detection, is the detection of the presence or absence of human speech, used in speech processing. The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol (VoIP) applications, saving on computation and on network bandwidth.

A web accelerator is a proxy server that reduces website access time. They can be a self-contained hardware appliance or installable software.

<span class="mw-page-title-main">Rotary-screw compressor</span> Gas compressor using a rotary positive-displacement mechanism

A rotary-screw compressor is a type of gas compressor, such as an air compressor, that uses a rotary-type positive-displacement mechanism. These compressors are common in industrial applications and replace more traditional piston compressors where larger volumes of compressed gas are needed, e.g. for large refrigeration cycles such as chillers, or for compressed air systems to operate air-driven tools such as jackhammers and impact wrenches. For smaller rotor sizes the inherent leakage in the rotors becomes much more significant, leading to this type of mechanism being less suitable for smaller compressors than piston compressors.

Dysprosody, which may manifest as pseudo-foreign accent syndrome, refers to a disorder in which one or more of the prosodic functions are either compromised or eliminated.

References

  1. N., Pam M.S., "TIME-COMPRESSED SPEECH," in PsychologyDictionary.org, April 29, 2013, https://psychologydictionary.org/time-compressed-speech/ (accessed February 20, 2019).
  2. "A Very Brief History on the Fast-Talking Style". thevoe.com. 4 December 2014.
  3. "Understanding the Auctioneer's Chant". rmfarm.tripod.com.
  4. "Compressed Speech". reference.com.
  5. 1 2 3 "Time compression dictionary definition - time compression defined". www.yourdictionary.com.
  6. Timothy D. Green. "Embedded systems programming with the PIC16F877". 2008. p. 159.
  7. "Advertising Tactics That Bug Americans the Most - Consumer Reports". www.consumerreports.org.
  8. "Techniques, Perception, and Applications of Time-Compressed Speech" (PDF). mit.edu.
  9. Pallier, Christophe; Sebastian-Gallés, Nuria; Dupoux, Emmanuel; Christophe, Anne; Mehler, Jacques (1 July 1998). "Perceptual adjustment to time-compressed speech: A cross-linguistic study". Memory & Cognition. 26 (4): 844–851. doi: 10.3758/BF03211403 . PMID   9701975.
  10. Barabasz, A. F.; A study of recall and retention of accelerated lecture presentation; Journal of Communication; 18(3), 1968: p.283–287.
  11. Benz, C.R.; Effects of Time Compressed Speech Upon the Comprehension of A Visual Oriented Television Lecture (1971); cited in Handbook of Research on Educational Communications and Technology; by David H. Jonassen; Association of Educational Communications and Technology (AECT); Bloomington, IN: 2004.
  12. "Comprehension of Ultra-Fast Speech – Blind vs. "Normally Hearing" Persons (2007)" (PDF). icphs2007.de.
  13. Gordon-Salant, S and Friedman, S. A.; Recognition of Rapid Speech By Blind and Sighted Adults; Journal of Speech, Language, and Hearing Research; 54(2), April 2011: p.622-631
  14. Gordon-Salant, S. and Fitzgibbons, P.J.; Sources of age-related recognition difficulty for time-compressed speech; Journal of Language, Speech, and Hearing Research; 44(4), August 2001: p.709-19
  15. Zhoa, Y.; The effects of listeners' control of speech rate on second language comprehension; Applied Linguistics; 18(1), March 2997: p.49-68
  16. Banai, K. and Lavner, Y.; Perceptual Learning of Time-Compressed Speech: More than Rapid Adaptation; PLoS One; National Institute of Health; Bethesda, Maryland: &(10), October 2012
  17. Arons, B. “Techniques, Perception, and Applications of Time-Compressed Speech.” In Proceedings of 1992 Conference, American Voice I/O Society, Sep. 1992, pp. 169-177.
  18. Omoigui, N., He, L., Gupta, A., Grudin, J., and Sanocki, E.; Time-Compression: Systems Concerns, Usage, and Benefit; Microsoft Research; Redmond, Washington: 1999.

Further reading

Time-compression algorithms

  • M. Covell, M. Withgott, and M. Slaney, “Mach1: Nonuniform time-scale modification of speech,” in Proc. ICASSP, vol. 1. Seattle, USA: IEEE, May 1998, pp. 349–352.
  • M. Demol, W. Verhelst, K. Struyve, and P. Verhoeve, “Efficient non-uniform time-scaling of speech with WSOLA,” in Proceedings of SPECOM, Petras, Greece, Oct. 2005, pp. 163–166.

See also