Speex

Last updated
Speex
Speex logo 2006.svg
Filename extension
.spx
Internet media type
audio/x-speex, audio/speex, audio/ogg
Developed by Xiph.Org Foundation, Jean-Marc Valin
Type of format Lossy audio
Contained by Ogg
Standard RFC 5574
Open format?Yes [1]
Website www.speex.org
libspeex
Developer(s) Xiph.Org Foundation, Jean-Marc Valin [2]
Initial release1.0 / March 2003
Stable release
1.2.1 [3] / June 16, 2022;15 months ago (2022-06-16)
Repository
Operating system Cross-platform
Type Audio codec, reference implementation
License BSD-style license [4] [5]
Website Xiph.org downloads

Speex is an audio compression codec specifically tuned for the reproduction of human speech and also a free software speech codec that may be used on voice over IP applications and podcasts. [6] It is based on the code excited linear prediction speech coding algorithm. [7] Its creators claim Speex to be free of any patent restrictions and it is licensed under the revised (3-clause) BSD license. It may be used with the Ogg container format or directly transmitted over UDP/RTP. It may also be used with the FLV container format. [8]

Contents

The Speex designers see their project as complementary to the Vorbis general-purpose audio compression project.

Speex is a lossy format, i.e. quality is permanently degraded to reduce file size.

The Speex project was created on February 13, 2002. [9] The first development versions of Speex were released under LGPL license, but as of version 1.0 beta 1, Speex is released under Xiph's version of the (revised) BSD license. [10] Speex 1.0 was announced on March 24, 2003, after a year of development. [11] The last stable version of Speex encoder and decoder is 1.2.1. [3]

Xiph.Org now considers Speex obsolete; its successor is the more modern Opus codec, which uses the SILK format under license from Microsoft and surpasses its performance in most areas except at the lowest sample rates. [12]

Description

Speex is targeted at voice over IP (VoIP) and file-based compression. The design goals have been to make a codec that would be optimized for high quality speech and low bit rate. To achieve this the codec uses multiple bit rates, and supports ultra-wideband (32  kHz sampling rate), wideband (16 kHz sampling rate) and narrowband (telephone quality, 8 kHz sampling rate). Since Speex was designed for VoIP instead of cell phone use, the codec must be robust to lost packets, but not to corrupted ones. All this led to the choice of code excited linear prediction (CELP) as the encoding technique to use for Speex. [7] One of the main reasons is that CELP has long proven that it could do the job and scale well to both low bit rates (as evidenced by DoD CELP @ 4.8 kbit/s) and high bit rates (as with G.728 @ 16 kbit/s). The main characteristics can be summarized as follows:

Features

Sampling rate
Speex is mainly designed for three different sampling rates: 8 kHz (the same sampling rate to transmit telephone calls), 16 kHz, and 32 kHz. These are respectively referred to as narrowband, wideband and ultra-wideband.
Quality
Speex encoding is controlled most of the time by a quality parameter that ranges from 0 to 10. In constant bit-rate (CBR) operation, the quality parameter is an integer, while for variable bit-rate (VBR), the parameter is a real (floating point) number.
Complexity (variable)
With Speex, it is possible to vary the complexity allowed for the encoder. This is done by controlling how the search is performed with an integer ranging from 1 to 10 in a way similar to the -1 to -9 options to gzip compression utilities. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU requirements for complexity 10 is about five times higher than for complexity 1. In practice, the best trade-off is between complexity 2 and 4, [13] though higher settings are often useful when encoding non-speech sounds like DTMF tones, or if encoding is not in real-time.
Variable bit-rate (VBR)
Variable bit-rate (VBR) allows a codec to change its bit rate dynamically to adapt to the "difficulty" of the audio being encoded. In the example of Speex, sounds like vowels and high-energy transients require a higher bit rate to achieve good quality, while fricatives (e.g. s and f sounds) can be coded adequately with fewer bits. For this reason, VBR can achieve lower bit rate for the same quality, or a better quality for a certain bit rate. Despite its advantages, VBR has three main drawbacks: first, by only specifying quality, there is no guarantee about the final average bit-rate. Second, for some real-time applications like voice over IP (VoIP), what counts is the maximum bit-rate, which must be low enough for the communication channel. Third, encryption of VBR-encoded speech may not ensure complete privacy, as phrases can still be identified, at least in a controlled setting with a small dictionary of phrases, [14] by analysing the pattern of variation of the bit rate.
Average bit-rate (ABR)
Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBR quality in order to meet a specific target bit-rate. Because the quality/bit-rate is adjusted in real-time (open-loop), the global quality will be slightly lower than that obtained by encoding in VBR with exactly the right quality setting to meet the target average bitrate.
Voice Activity Detection (VAD)
When enabled, voice activity detection detects whether the audio being encoded is speech or silence/background noise. VAD is always implicitly activated when encoding in VBR, so the option is only useful in non-VBR operation. In this case, Speex detects non-speech periods and encodes them with just enough bits to reproduce the background noise. This is called "comfort noise generation" (CNG). Last version VAD was working fine is 1.1.12, since v 1.2 it has been replaced with simple Any Activity Detection.
Discontinuous transmission (DTX)
Discontinuous transmission is an addition to VAD/VBR operation which allows ceasing transmitting completely when the background noise is stationary. In a file, 5 bits are used for each missing frame (corresponding to 250 bit/s).
Perceptual enhancement
Perceptual enhancement is a part of the decoder which, when turned on, tries to reduce (the perception of) the noise produced by the coding/decoding process. In most cases, perceptual enhancement makes the sound further from the original objectively (signal-to-noise ratio), but in the end it still sounds better (subjective improvement).
Algorithmic delay
Every codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount of "look-ahead" required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16 kHz), the delay is 34 ms. These values do not account for the CPU time it takes to encode or decode the frames.

Applications

Comparison of audio codecs for speech. Opus quality comparison colorblind compatible.svg
Comparison of audio codecs for speech.

There are a large base of applications supporting the Speex codec. Examples include:

Most of these are based on the DirectShow filter or OpenACM codec (e.g. Microsoft NetMeeting) on Microsoft Windows, or Xiph.org's reference implementation, libvorbis, on Linux (e.g. Ekiga). There are also plugins for many audio players. See the plugin and software page on the speex.org site for more details. [16]

The media type for Speex is audio/ogg while contained by Ogg, and audio/speex (previously audio/x-speex) when transported through RTP or without container.

The United States Army's Land Warrior system, designed by General Dynamics, also uses Speex for VoIP on an EPLRS radio designed by Raytheon.

The Ear Bible [17] is a single-ear headphone with a built-in Speex player with 1 GB of flash memory, [18] preloaded with a recording of the New American Standard Bible.

ASL Safety & Security's [19] Linux based VIPA OS software [20] which is used in long line public address systems and voice alarm systems at major international air transport hubs and rail networks.

The Rockbox project uses Speex for its voice interface. It can also play Speex files on supported players, such as the Apple iPod or the iRiver H10.

The Vernier LabQuest [21] handheld data acquisition device for science education uses Speex for voice annotations created by students and teachers using either the built-in or an external microphone.

The Google Mobile App for iPhone currently incorporates Speex. [22] It has also been suggested that the new Google voice search iPhone app is using Speex to transmit voice to Google servers for interpretation. [23]

Adobe Flash Player supports Speex starting with Flash Player 10.0.12.36, released in October 2008. [24] Because of some bugs in Flash Player, the first recommended version for Speex support is 10.0.22.87 and later. Speex in Flash Player can be used for both kind of communication, through Flash Media Server or P2P. Speex can be decoded or converted to any format unlike Nellymoser audio, which was the only speech format in previous versions of Flash Player. [25] [26] Speex can be also used in the Flash Video container format (.flv), starting with version 10 of Video File Format Specification (published in November 2008). [27]

The JavaSonics ListenUp [28] voice recorder uses Speex to compress voice messages that are recorded in a browser and then uploaded to a web server. Primary applications are language training, transcription and social networking.

Speex is used as the voice compression algorithm in the Siri voice assistance on the iPhone 4S. [29] Since text-to-speech occurs on Apple's servers, the Speex codec is used to minimize network bandwidth.

See also

Sources

This article uses material from the Speex Codec Manual which is copyright © Jean-Marc Valin and licensed under the terms of the GFDL.

Related Research Articles

<span class="mw-page-title-main">Ogg</span> Open container format maintained by the Xiph.Org Foundation

Ogg is a free, open container format maintained by the Xiph.Org Foundation. The authors of the Ogg format state that it is unrestricted by software patents and is designed to provide for efficient streaming and manipulation of high-quality digital multimedia. Its name is derived from "ogging", jargon from the computer game Netrek.

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Windows Media Audio (WMA) is a series of audio codecs and their corresponding audio coding formats developed by Microsoft. It is a proprietary technology that forms part of the Windows Media framework. WMA consists of four distinct codecs. The original WMA codec, known simply as WMA, was conceived as a competitor to the popular MP3 and RealAudio codecs. WMA Pro, a newer and more advanced codec, supports multichannel and high resolution audio. A lossless codec, WMA Lossless, compresses audio data without loss of audio fidelity. WMA Voice, targeted at voice content, applies compression using a range of low bit rates. Microsoft has also developed a digital container format called Advanced Systems Format to store audio encoded by WMA.

Xiph.Org Foundation is a nonprofit organization that produces free multimedia formats and software tools. It focuses on the Ogg family of formats, the most successful of which has been Vorbis, an open and freely licensed audio format and codec designed to compete with the patented WMA, MP3 and AAC. As of 2013, development work was focused on Daala, an open and patent-free video format and codec designed to compete with VP9 and the patented High Efficiency Video Coding.

The Adaptive Multi-Rateaudio codec is an audio compression format optimized for speech coding. AMR is a multi-rate narrowband speech codec that encodes narrowband (200–3400 Hz) signals at variable bit rates ranging from 4.75 to 12.2 kbit/s with toll quality speech starting at 7.4 kbit/s.

Full Rate was the first digital speech coding standard used in the GSM digital mobile phone system. It uses linear predictive coding (LPC). The bit rate of the codec is 13 kbit/s, or 1.625 bits/audio sample. The quality of the coded speech is quite poor by modern standards, but at the time of development it was a good compromise between computational complexity and quality, requiring only on the order of a million additions and multiplications per second. The codec is still widely used in networks around the world. Gradually FR will be replaced by Enhanced Full Rate (EFR) and Adaptive Multi-Rate (AMR) standards, which provide much higher speech quality with lower bit rate.

Adaptive Multi-Rate Wideband (AMR-WB) is a patented wideband speech audio coding standard developed based on Adaptive Multi-Rate encoding, using a similar methodology to algebraic code-excited linear prediction (ACELP). AMR-WB provides improved speech quality due to a wider speech bandwidth of 50–7000 Hz compared to narrowband speech coders which in general are optimized for POTS wireline quality of 300–3400 Hz. AMR-WB was developed by Nokia and VoiceAge and it was first specified by 3GPP.

<span class="mw-page-title-main">G.729</span> ITU-T Recommendation

G.729 is a royalty-free narrow-band vocoder-based audio data compression algorithm using a frame length of 10 milliseconds. It is officially described as Coding of speech at 8 kbit/s using code-excited linear prediction speech coding (CS-ACELP), and was introduced in 1996. The wide-band extension of G.729 is called G.729.1, which equals G.729 Annex J.

<span class="mw-page-title-main">G.722</span> ITU-T recommendation

G.722 is an ITU-T standard 7 kHz wideband audio codec operating at 48, 56 and 64 kbit/s. It was approved by ITU-T in November 1988. Technology of the codec is based on sub-band ADPCM (SB-ADPCM). The corresponding narrow-band codec based on the same technology is G.726.

<span class="mw-page-title-main">G.722.1</span> ITU-T Recommendation

G.722.1 is a licensed royalty-free ITU-T standard audio codec providing high quality, moderate bit rate wideband (50 Hz – 7 kHz audio bandwidth, 16 ksps audio coding. It is a partial implementation of Siren 7 audio coding format developed by PictureTel Corp.. Its official name is Low-complexity coding at 24 and 32 kbit/s for hands-free operation in systems with low frame loss. It uses a modified discrete cosine transform audio data compression algorithm.

Code-excited linear prediction (CELP) is a linear predictive speech coding algorithm originally proposed by Manfred R. Schroeder and Bishnu S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algorithms, such as residual-excited linear prediction (RELP) and linear predictive coding (LPC) vocoders. Along with its variants, such as algebraic CELP, relaxed CELP, low-delay CELP and vector sum excited linear prediction, it is currently the most widely used speech coding algorithm. It is also used in MPEG-4 Audio speech coding. CELP is commonly used as a generic term for a class of algorithms and not for a particular codec.

internet Speech Audio Codec (iSAC) is a wideband speech codec, developed by Global IP Solutions (GIPS). It is suitable for VoIP applications and streaming audio. The encoded blocks have to be encapsulated in a suitable protocol for transport, e.g. RTP.

<span class="mw-page-title-main">G.729.1</span> ITU-T Recommendation

G.729.1 is an 8-32 kbit/s embedded speech and audio codec providing bitstream interoperability with G.729, G.729 Annex A and G.729 Annex B. Its official name is G.729-based embedded variable bit rate codec: An 8-32 kbit/s scalable wideband coder bitstream interoperable with G.729. It was introduced in 2006.

Siren is a family of patented, transform-based, wideband audio coding formats and their audio codec implementations developed and licensed by PictureTel Corporation. There are three Siren codecs: Siren 7, Siren 14 and Siren 22.

Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency range of audio signals transmitted over telephone lines, resulting in higher quality speech. The range of the human voice extends from 100 Hz to 17 kHz but traditional, voiceband or narrowband telephone calls limit audio frequencies to the range of 300 Hz to 3.4 kHz. Wideband audio relaxes the bandwidth limitation and transmits in the audio frequency range of 50 Hz to 7 kHz. In addition, some wideband codecs may use a higher audio bit depth of 16 bits to encode samples, also resulting in much better voice quality.

<span class="mw-page-title-main">G.718</span> ITU-T Recommendation

G.718 is an ITU-T Recommendation embedded scalable speech and audio codec providing high quality narrowband speech over the lower bit rates and high quality wideband speech over the complete range of bit rates. In addition, G.718 is designed to be highly robust to frame erasures, thereby enhancing the speech quality when used in Internet Protocol (IP) transport applications on fixed, wireless and mobile networks. Despite its embedded nature, the codec also performs well with both narrowband and wideband generic audio signals. The codec has an embedded scalable structure, enabling maximum flexibility in the transport of voice packets through IP networks of today and in future media-aware networks. In addition, the embedded structure of G.718 will easily allow the codec to be extended to provide a superwideband and stereo capability through additional layers which are currently under development in ITU-T Study Group 16. The bitstream may be truncated at the decoder side or by any component of the communication system to instantaneously adjust the bit rate to the desired value without the need for out-of-band signalling. The encoder produces an embedded bitstream structured in five layers corresponding to the five available bit rates: 8, 12, 16, 24 & 32 kbit/s.

Constrained Energy Lapped Transform (CELT) is an open, royalty-free lossy audio compression format and a free software codec with especially low algorithmic delay for use in low-latency audio communication. The algorithms are openly documented and may be used free of software patent restrictions. Development of the format was maintained by the Xiph.Org Foundation and later coordinated by the Opus working group of the Internet Engineering Task Force (IETF).

<span class="mw-page-title-main">Opus (audio format)</span> Lossy audio coding format

Opus is a lossy audio coding format developed by the Xiph.Org Foundation and standardized by the Internet Engineering Task Force, designed to efficiently code speech and general audio in a single format, while remaining low-latency enough for real-time interactive communication and low-complexity enough for low-end embedded processors. Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher-quality than any other standard audio format at any given bitrate until transparency is reached, including MP3, AAC, and HE-AAC.

Enhanced Voice Services (EVS) is a superwideband speech audio coding standard that was developed for VoLTE. It offers up to 20 kHz audio bandwidth and has high robustness to delay jitter and packet losses due to its channel aware coding and improved packet loss concealment. It has been developed in 3GPP and is described in 3GPP TS 26.441. The application areas of EVS consist of improved telephony and teleconferencing, audiovisual conferencing services, and streaming audio. Source code of both decoder and encoder in ANSI C is available as 3GPP TS 26.442 and is being updated regularly. Samsung uses the term HD+ when doing a call using EVS.

References

  1. "PlayOgg! - FSF - Free Software Foundation". 2010-03-17. Retrieved 2013-10-01.
  2. Jean-Marc Valin (2009). "people.xiph.org - personal webspace of the xiphs - Jean-Marc Valin". Xiph.Org. Retrieved 2009-09-11.
  3. 1 2 "Speex News". Xiph.Org Foundation. Retrieved 2023-04-13.
  4. "The Speex Codec Manual - Speex License". Xiph.Org Foundation. Retrieved 2009-09-01.
  5. "Sample Xiph.Org Variant of the BSD License". Xiph.Org Foundation. Retrieved 2009-08-29.
  6. Xiph.Org Speex: A Free Codec For Free Speech, Retrieved 2009-09-01
  7. 1 2 Xiph.Org Introduction to CELP Coding, Retrieved 2009-09-01
  8. Adobe FLV format specification, retrieved 2016-04-18
  9. Xiph.org Speex releases - pre-1.0 - NEWS and ChangeLog in speex-0.0.1.tar.gz, Retrieved 2009-09-01
  10. Xiph.Org Speex FAQ – Under what license is Speex released?, Retrieved 2009-09-01
  11. Xiph.Org (2003-03-24) Speex reaches 1.0; Xiph.Org now a 501(c)(3) Non-Profit Organization, Retrieved 2009-09-01
  12. Speex homepage, retrieved 2017-04-11
  13. "Codec description". www.speex.org.
  14. "Spot me if you can: Uncovering Spoken Phrases in Encrypted VoIP Conversations (Charles V. Wright Lucas Ballard Scott E. Coull Fabian Monrose Gerald M. Masson)" (PDF).
  15. As announced by Ralph Giles, the Theora codec maintainer, on LugRadio episode 29
  16. "A free codec for free speech". Speex. Retrieved 2012-12-29.
  17. Lascelles, LLC. "The worlds most convenient Audio Bible". Ear Bible. Retrieved 2012-12-29.
  18. Lascelles, LLC. "Support". Ear Bible. Retrieved 2012-12-29.
  19. "PA/VA, PSIM Software and Station Management Systems > ASL Safety & Security". Asl-control.co.uk. Retrieved 2012-12-29.
  20. IPAM 400: IP Based Intelligent Public Address Amplifier Archived 2011-09-04 at the Wayback Machine - User Manual
  21. "LabQuest 2 > Vernier Software & Technology". Vernier.com. 2012-05-23. Retrieved 2012-12-29.
  22. "Legal Notices". Google Inc. Retrieved 2014-12-05.
  23. Baio, Andy (November 18, 2008). "Deconstructing Google Mobile's Voice Search on the iPhone".
  24. Adobe (2008) Flash Player 10 Datasheet, Retrieved 2009-09-01
  25. AskMeFlash.com (2009-05-10) Speex for Flash, Retrieved on 2009-08-12
  26. AskMeFlash.com (2009-05-10) Speex vs Nellymoser Archived 2009-04-15 at the Wayback Machine , Retrieved on 2009-08-12
  27. Adobe Systems Incorporated (November 2008). "Video File Format Specification, Version 10" (PDF). Adobe Systems Incorporated. Archived from the original (PDF) on 2010-09-23. Retrieved 2014-12-05.
  28. Phil Burk. "JavaSonics ListenUp voice recording Applet for Java that uploads messages to a web server". Javasonics.com. Retrieved 2012-12-29.
  29. "Applidium — News". Applidium.com. Archived from the original on 2011-11-16. Retrieved 2012-12-29.