Satin (codec)

Last updated
Satin (codec)
Developed byMicrosoft
Initial release2020
Type of formatLossy audio codec
Open format?No

Satin is a lossy speech codec developed by Microsoft. Satin was designed to supersede the earlier Silk codec in their applications, and implements a neural network and novel signal processing to improve performance over its predecessor. [1]

Contents

Features

Satin is designed to deliver good sound quality despite limited bandwidth or high packet loss, such as over unreliable WiFi or cellular networks. [2] Satin can produce output bitrates of 6 to 36 kbps, and operates on super-wideband audio (a 32 kHz sampling rate). Sound is encoded by processing a sparse representation of the input, then decoded with the help of a neural network that infers the high frequencies from the low ones. [1] Because neural networks are computationally complex, optimization and vectorization of the network were required to achieve acceptable performance. [3] To improve resilience to packet loss, each packet is encoded independently and the codec has its own packet loss concealment system. [3]

History

Silk was developed by Skype and can compress wideband speech in 14 kbps. Satin is considered to be Silk's successor, and was initially announced and implemented for Microsoft Teams in 2020. As of February 2021, it was used for all two-way calls in both Teams and Skype. [2] According to Microsoft, a future release will add support for music in full-band stereo at bitrates of at least 17 kbps. [2]

Quality

Microsoft claims that Satin's quality is significantly better than Silk, achieving mean opinion scores up to 1.7 points higher in low-bitrate A/B testing. [1] Microsoft also notes that Satin's bitrate savings allows for sending more redundant data to increase resistance to packet loss. [4]

Support

As of February 2021, Skype and Microsoft Teams implemented Satin for all two-person calls, and an expansion to larger Teams meetings was planned. [2]

Related Research Articles

A codec is a device or computer program that encodes or decodes a data stream or signal. Codec is a portmanteau of coder/decoder.

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Speex is an audio compression codec specifically tuned for the reproduction of human speech and also a free software speech codec that may be used on voice over IP applications and podcasts. It is based on the code excited linear prediction speech coding algorithm. Its creators claim Speex to be free of any patent restrictions and it is licensed under the revised (3-clause) BSD license. It may be used with the Ogg container format or directly transmitted over UDP/RTP. It may also be used with the FLV container format.

Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for voice calls for the delivery of voice communication sessions over Internet Protocol (IP) networks, such as the Internet.

<span class="mw-page-title-main">G.711</span> ITU-T recommendation

G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. It is an ITU-T standard (Recommendation) for audio encoding, titled Pulse code modulation (PCM) of voice frequencies released for use in 1972.

Adaptive Multi-Rate Wideband (AMR-WB) is a patented wideband speech audio coding standard developed based on Adaptive Multi-Rate encoding, using a similar methodology to algebraic code-excited linear prediction (ACELP). AMR-WB provides improved speech quality due to a wider speech bandwidth of 50–7000 Hz compared to narrowband speech coders which in general are optimized for POTS wireline quality of 300–3400 Hz. AMR-WB was developed by Nokia and VoiceAge and it was first specified by 3GPP.

<span class="mw-page-title-main">G.722</span> ITU-T recommendation

G.722 is an ITU-T standard 7 kHz wideband audio codec operating at 48, 56 and 64 kbit/s. It was approved by ITU-T in November 1988. Technology of the codec is based on sub-band ADPCM (SB-ADPCM). The corresponding narrow-band codec based on the same technology is G.726.

Variable-Rate Multimode Wideband (VMR-WB) is a source-controlled variable-rate multimode codec designed for robust encoding/decoding of wideband/narrowband speech. The operation of VMR-WB is controlled by speech signal characteristics and by traffic condition of the network. Depending on the traffic conditions and the desired quality of service (QoS), one of the 4 operational modes is used. All operating modes of the existing VMR-WB standard are fully compliant with cdma2000 rate-set II. VMR-WB modes 0, 1, and 2 are cdma2000 native modes with mode 0 providing the highest quality and mode 2 the lowest ADR. VMR-WB mode 3 is the AMR-WB interoperable mode operating at an ADR slightly higher than mode 0 and providing a quality equal or better than that of AMR-WB at 12.65 kbit/s when in an interoperable interconnection with AMR-WB at 12.65 kbit/s.

Internet Low Bitrate Codec (iLBC) is a royalty-free narrowband speech audio coding format and an open-source reference implementation (codec), developed by Global IP Solutions (GIPS) formerly Global IP Sound. It was formerly freeware with limitations on commercial use, but since 2011 it is available under a free software/open source license as a part of the open source WebRTC project. It is suitable for VoIP applications, streaming audio, archival and messaging. The algorithm is a version of block-independent linear predictive coding, with the choice of data frame lengths of 20 and 30 milliseconds. The encoded blocks have to be encapsulated in a suitable protocol for transport, usually the Real-time Transport Protocol (RTP).

Packet loss concealment (PLC) is a technique to mask the effects of packet loss in voice over IP (VoIP) communications. When the voice signal is sent as VoIP packets on an IP network, the packets may travel different routes. A packet therefore might arrive very late, might be corrupted, or simply might not arrive at all. One example case of the last situation could be, when a packet is rejected by a server which has a full buffer and cannot accept any more data. Other cases include network congestion resulting in significant delay. In a VoIP connection, error-control techniques such as automatic repeat request (ARQ) are not feasible and the receiver should be able to cope with packet loss. Packet loss concealment is the inclusion in a design of methodologies for accounting for and compensating for the loss of voice packets.

SVOPC is a compression method for audio which is used by VOIP applications. It is a lossy speech compression codec designed specifically towards communication channels suffering from packet loss. It uses more bandwidth than best bandwidth-optimised codecs, but it is packet loss resistant instead.

Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency range of audio signals transmitted over telephone lines, resulting in higher quality speech. The range of the human voice extends from 100 Hz to 17 kHz but traditional, voiceband or narrowband telephone calls limit audio frequencies to the range of 300 Hz to 3.4 kHz. Wideband audio relaxes the bandwidth limitation and transmits in the audio frequency range of 50 Hz to 7 kHz. In addition, some wideband codecs may use a higher audio bit depth of 16 bits to encode samples, also resulting in much better voice quality.

Constrained Energy Lapped Transform (CELT) is an open, royalty-free lossy audio compression format and a free software codec with especially low algorithmic delay for use in low-latency audio communication. The algorithms are openly documented and may be used free of software patent restrictions. Development of the format was maintained by the Xiph.Org Foundation and later coordinated by the Opus working group of the Internet Engineering Task Force (IETF).

SILK is an audio compression format and audio codec developed by Skype Limited, now a Microsoft subsidiary. It was developed for use in Skype, as a replacement for the SVOPC codec. Since licensing out, it has also been used by others. It has been extended to the Internet standard Opus codec.

<span class="mw-page-title-main">Opus (audio format)</span> Lossy audio coding format

Opus is a lossy audio coding format developed by the Xiph.Org Foundation and standardized by the Internet Engineering Task Force, designed to efficiently code speech and general audio in a single format, while remaining low-latency enough for real-time interactive communication and low-complexity enough for low-end embedded processors. Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher-quality than any other standard audio format at any given bitrate until transparency is reached, including MP3, AAC, and HE-AAC.

Unified Speech and Audio Coding (USAC) is an audio compression format and codec for both music and speech or any mix of speech and audio using very low bit rates between 12 and 64 kbit/s. It was developed by Moving Picture Experts Group (MPEG) and was published as an international standard ISO/IEC 23003-3 and also as an MPEG-4 Audio Object Type in ISO/IEC 14496-3:2009/Amd 3 in 2012.

Enhanced Voice Services (EVS) is a superwideband speech audio coding standard that was developed for VoLTE. It offers up to 20 kHz audio bandwidth and has high robustness to delay jitter and packet losses due to its channel aware coding and improved packet loss concealment. It has been developed in 3GPP and is described in 3GPP TS 26.441. The application areas of EVS consist of improved telephony and teleconferencing, audiovisual conferencing services, and streaming audio. Source code of both decoder and encoder in ANSI C is available as 3GPP TS 26.442 and is being updated regularly. Samsung uses the term HD+ when doing a call using EVS.

LC3 is an audio codec specified by the Bluetooth Special Interest Group (SIG) for the LE Audio audio protocol introduced in Bluetooth 5.2. It's developed by Fraunhofer IIS and Ericsson as the successor of the SBC codec.

Lyra is a lossy audio codec developed by Google that is designed for compressing speech at very low bitrates. Unlike most other audio formats, it compresses data using a machine learning-based algorithm.

References

  1. 1 2 3 "Microsoft Satin Audio Codec Uses AI to Outperform Skype Silk". InfoQ. Retrieved 2022-07-22.
  2. 1 2 3 4 "Microsoft Teams: Get ready for clearer sound on your meetings thanks to this audio upgrade". ZDNet. Retrieved 2022-07-22.
  3. 1 2 "Microsoft details Satin, a new AI powered audio codec that powers Teams - MSPoweruser". mspoweruser.com. 2021-02-15. Retrieved 2022-07-22.
  4. "Satin: Microsoft's latest AI-powered audio codec for real-time communications". TECHCOMMUNITY.MICROSOFT.COM. 2021-02-17. Retrieved 2022-07-22.

See also