Latency refers to a short period of delay (usually measured in milliseconds) between when an audio signal enters a system, and when it emerges. Potential contributors to latency in an audio system include analog-to-digital conversion, buffering, digital signal processing, transmission time, digital-to-analog conversion, and the speed of sound in the transmission medium.
Latency can be a critical performance metric in professional audio including sound reinforcement systems, foldback systems (especially those using in-ear monitors) live radio and television. Excessive audio latency has the potential to degrade call quality in telecommunications applications. Low latency audio in computers is important for interactivity.
In all systems, latency can be said to consist of three elements: codec delay, playout delay and network delay.
Latency in telephone calls is sometimes referred to as mouth-to-ear delay; the telecommunications industry also uses the term quality of experience (QoE). Voice quality is measured according to the ITU model; measurable quality of a call degrades rapidly where the mouth-to-ear delay latency exceeds 200 milliseconds. The mean opinion score (MOS) is also comparable in a near-linear fashion with the ITU's quality scale - defined in standards G.107, [1] : 800 G.108 [2] and G.109 [3] - with a quality factor R ranging from 0 to 100. An MOS of 4 ('Good') would have an R score of 80 or above; to achieve 100R requires an MOS exceeding 4.5.
The ITU and 3GPP groups end-user services into classes based on latency sensitivity: [4]
Very sensitive to delayLess sensitive to delay | |||||
---|---|---|---|---|---|
Classes |
|
|
|
| |
Services | Conversational video/voice, realtime video | Voice messaging | Streaming video and voice | Fax | |
Realtime data | Transactional data | Non realtime data | Background data |
Similarly, the G.114 recommendation regarding mouth-to-ear delay indicates that most users are "very satisfied" as long as latency does not exceed 200 ms, with an according R of 90+. Codec choice also plays an important role; the highest quality (and highest bandwidth) codecs like G.711 are usually configured to incur the least encode-decode latency, so on a network with sufficient throughput sub-100 ms latencies can be achieved. G.711 at a bitrate of 64 kbit/s is the encoding method predominantly used on the public switched telephone network.
The AMR narrowband codec, used in GSM and UMTS networks, introduces latency in the encode and decode processes.
As mobile operators upgrade existing best-effort networks to support concurrent multiple types of service over all-IP networks, services such as Hierarchical Quality of Service (H-QoS) allow for per-user, per-service QoS policies to prioritise time-sensitive protocols like voice calls, and other wireless backhaul traffic. [5] [6] [7]
Another aspect of mobile latency is the inter-network handoff; as a customer on Network A calls a Network B customer the call must traverse two separate Radio Access Networks, two core networks, and an interlinking Gateway Mobile Switching Centre (GMSC) which performs the physical interconnecting between the two providers. [8]
With end-to-end QoS managed and assured rate connections, latency can be reduced to analogue PSTN/POTS levels. On a stable connection with sufficient bandwidth and minimal latency, VoIP systems typically have a minimum of 20 ms inherent latency. Under less ideal network conditions a 150 ms maximum latency is sought for general consumer use. [9] [10] Many popular videoconferencing systems rely on data buffering and data redundancy to cope for network jitter and packet loss. Measurements have shown that mouth-to-ear delay are between 160 to 300 ms over a 500 mile distance, on an average US network conditions.[ citation needed ] Latency is a larger consideration when an echo is present and systems must perform echo suppression and cancellation. [11]
Latency can be a particular problem in audio platforms on computers. Supported interface optimizations reduce the delay down to times that are too short for the human ear to detect. By reducing buffer sizes, latency can be reduced. [12] A popular optimization solution is Steinberg's ASIO, which bypasses the audio platform, and connects audio signals directly to the sound card's hardware. Many professional and semi-professional audio applications utilize the ASIO driver, allowing users to work with audio in real time. [13] Pro Tools HD offers a low latency system similar to ASIO. Pro Tools 10 and 11 are also compatible with ASIO interface drivers.
The Linux realtime kernel [14] is a modified kernel, that alters the standard timer frequency the Linux kernel uses and gives all processes or threads the ability to have realtime priority. This means that a time-critical process like an audio stream can get priority over another, less-critical process like network activity. This is also configurable per user (for example, the processes of user "tux" could have priority over processes of user "nobody" or over the processes of several system daemons).
Many modern digital television receivers, set-top boxes and AV receivers use sophisticated audio processing, which can create a delay between the time when the audio signal is received and the time when it is heard on the speakers. Since TVs also introduce delays in processing the video signal this can result in the two signals being sufficiently synchronized to be unnoticeable by the viewer. However, if the difference between the audio and video delay is significant, the effect can be disconcerting. Some systems have a lip sync setting that allows the audio lag to be adjusted to synchronize with the video, and others may have advanced settings where some of the audio processing steps can be turned off.
Audio lag is also a significant detriment in rhythm games, where precise timing is required to succeed. Most of these games have a lag calibration setting whereupon the game will adjust the timing windows by a certain number of milliseconds to compensate. In these cases, the notes of a song will be sent to the speakers before the game even receives the required input from the player in order to maintain the illusion of rhythm. Games that rely upon musical improvisation, such as Rock Band drums or DJ Hero, can still suffer tremendously, as the game cannot predict what the player will hit in these cases, and excessive lag will still create a noticeable delay between hitting notes, and hearing them play.
Audio latency can be experienced in broadcast systems where someone is contributing to a live broadcast over a satellite or similar link with high delay. The person in the main studio has to wait for the contributor at the other end of the link to react to questions. Latency in this context could be between several hundred milliseconds and a few seconds. Dealing with audio latencies as high as this takes special training in order to make the resulting combined audio output reasonably acceptable to the listeners. Wherever practical, it is important to try to keep live production audio latency low in order to keep the reactions and interchange of participants as natural as possible. A latency of 10 milliseconds or better is the target for audio circuits within professional production structures. [15]
Latency in live performance occurs naturally from the speed of sound. It takes sound about 3 milliseconds to travel 1 meter. Small amounts of latency occur between performers depending on how they are spaced from each other and from stage monitors if these are used. This creates a practical limit to how far apart the artists in a group can be from one another. Stage monitoring extends that limit, as sound travels close to the speed of light through the cables that connect stage monitors.
Performers, particularly in large spaces, will also hear reverberation, or echo of their music, as the sound that projects from stage bounces off of walls and structures, and returns with latency and distortion. A primary purpose of stage monitoring is to provide artists with more primary sound so that they are not confused by the latency of these reverberations.
While analog audio equipment has no appreciable latency, digital audio equipment has latency associated with two general processes: conversion from one format to another, and digital signal processing (DSP) tasks such as equalization, compression and routing.
Digital conversion processes include analog-to-digital converters (ADC), digital-to-analog converters (DAC), and various changes from one digital format to another, such as AES3 which carries low-voltage electrical signals to ADAT, an optical transport. Any such process takes a small amount of time to accomplish; typical latencies are in the range of 0.2 to 1.5 milliseconds, depending on sampling rate, software design and hardware architecture. [16]
Different audio signal processing operations such as finite impulse response (FIR) and infinite impulse response (IIR) filters take different mathematical approaches to the same end and can have different latencies. In addition, input and output sample buffering add delay. Typical latencies range from 0.5 to ten milliseconds with some designs having as much as 30 milliseconds of delay. [17]
Latency in digital audio equipment is most noticeable when a singer's voice is transmitted through their microphone, through digital audio mixing, processing and routing paths, then sent to their own ears via in-ear monitors or headphones. In this case, the singer's vocal sound is conducted to their own ear through the bones of the head, then through the digital pathway to their ears some milliseconds later. In one study, listeners found latency greater than 15 ms to be noticeable. Latency for other musical activities such as playing guitar does not have the same critical concern. Ten milliseconds of latency isn't as noticeable to a listener who is not hearing his or her own voice. [18]
In sound reinforcement for music or speech presentation in large venues, it is optimal to deliver sufficient sound volume to the back of the venue without resorting to excessive sound volumes near the front. One way for audio engineers to achieve this is to use additional loudspeakers placed at a distance from the stage but closer to the rear of the audience. Sound travels through air at the speed of sound (around 343 metres (1,125 ft) per second depending on air temperature and humidity). By measuring or estimating the difference in latency between the loudspeakers near the stage and the loudspeakers nearer the audience, the audio engineer can introduce an appropriate delay in the audio signal going to the latter loudspeakers, so that the wavefronts from near and far loudspeakers arrive at the same time. Because of the Haas effect an additional 15 milliseconds can be added to the delay time of the loudspeakers nearer the audience, so that the stage's wavefront reaches them first, to focus the audience's attention on the stage rather than the local loudspeaker. The slightly later sound from delayed loudspeakers simply increases the perceived sound level without negatively affecting localization.
A codec is a device or computer program that encodes or decodes a data stream or signal. Codec is a portmanteau of coder/decoder.
Latency, from a general point of view, is a time delay between the cause and the effect of some physical change in the system being observed. Lag, as it is known in gaming circles, refers to the latency between the input to a simulation and the visual or auditory response, often occurring because of network delay in online games.
Quality of service (QoS) is the description or measurement of the overall performance of a service, such as a telephony or computer network, or a cloud computing service, particularly the performance seen by the users of the network. To quantitatively measure quality of service, several related aspects of the network service are often considered, such as packet loss, bit rate, throughput, transmission delay, availability, jitter, etc.
Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.
Digital audio is a representation of sound recorded in, or converted into, digital form. In digital audio, the sound wave of the audio signal is typically encoded as numerical samples in a continuous sequence. For example, in CD audio, samples are taken 44,100 times per second, each with 16-bit sample depth. Digital audio is also the name for the entire technology of sound recording and reproduction using audio signals that have been encoded in digital form. Following significant advances in digital audio technology during the 1970s and 1980s, it gradually replaced analog audio technology in many areas of audio engineering, record production and telecommunications in the 1990s and 2000s.
Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for voice calls for the delivery of voice communication sessions over Internet Protocol (IP) networks, such as the Internet.
A mixing console or mixing desk is an electronic device for mixing audio signals, used in sound recording and reproduction and sound reinforcement systems. Inputs to the console include microphones, signals from electric or electronic instruments, or recorded sounds. Mixers may control analog or digital signals. The modified signals are summed to produce the combined output signals, which can then be broadcast, amplified through a sound reinforcement system or recorded.
In analog telephony, a telephone hybrid is the component at the ends of a subscriber line of the public switched telephone network (PSTN) that converts between two-wire and four-wire forms of bidirectional audio paths. When used in broadcast facilities to enable the airing of telephone callers, the broadcast-quality telephone hybrid is known as a broadcast telephone hybrid or telephone balance unit.
G.729 is a royalty-free narrow-band vocoder-based audio data compression algorithm using a frame length of 10 milliseconds. It is officially described as Coding of speech at 8 kbit/s using code-excited linear prediction speech coding (CS-ACELP), and was introduced in 1996. The wide-band extension of G.729 is called G.729.1, which equals G.729 Annex J.
A VoIP phone or IP phone uses voice over IP technologies for placing and transmitting telephone calls over an IP network, such as the Internet. This is in contrast to a standard phone which uses the traditional public switched telephone network (PSTN).
Delay is an audio signal processing technique that records an input signal to a storage medium and then plays it back after a period of time. When the delayed playback is mixed with the live audio, it creates an echo-like effect, whereby the original audio is heard followed by the delayed audio. The delayed signal may be played back multiple times, or fed back into the recording, to create the sound of a repeating, decaying echo.
Perceptual Evaluation of Audio Quality (PEAQ) is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994–1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union's Radiocommunication Sector (ITU-R). It was originally released as ITU-R Recommendation BS.1387 in 1998 and last updated in 2023. It utilizes software to simulate perceptual properties of the human ear and then integrates multiple model output variables into a single metric.
Audio-to-video synchronization refers to the relative timing of audio (sound) and video (image) parts during creation, post-production (mixing), transmission, reception and play-back processing. AV synchronization can be an issue in television, videoconferencing, or film.
CobraNet is a combination of software, hardware, and network protocols designed to deliver uncompressed, multi-channel, low-latency digital audio over a standard Ethernet network. Developed in the 1990s, CobraNet is widely regarded as the first commercially successful audio-over-Ethernet implementation.
A software effect processor is a computer program that alters the sound from a digital source through audio signal processing in real time. It is a digital analog of hardware effects processors. It is an integral part of audio editing software, such as in Adobe Audition
G.718 is an ITU-T Recommendation embedded scalable speech and audio codec providing high quality narrowband speech over the lower bit rates and high quality wideband speech over the complete range of bit rates. In addition, G.718 is designed to be highly robust to frame erasures, thereby enhancing the speech quality when used in Internet Protocol (IP) transport applications on fixed, wireless and mobile networks. Despite its embedded nature, the codec also performs well with both narrowband and wideband generic audio signals. The codec has an embedded scalable structure, enabling maximum flexibility in the transport of voice packets through IP networks of today and in future media-aware networks. In addition, the embedded structure of G.718 will easily allow the codec to be extended to provide a superwideband and stereo capability through additional layers which are currently under development in ITU-T Study Group 16. The bitstream may be truncated at the decoder side or by any component of the communication system to instantaneously adjust the bit rate to the desired value without the need for out-of-band signalling. The encoder produces an embedded bitstream structured in five layers corresponding to the five available bit rates: 8, 12, 16, 24 & 32 kbit/s.
Adaptive differential pulse-code modulation (ADPCM) is a variant of differential pulse-code modulation (DPCM) that varies the size of the quantization step, to allow further reduction of the required data bandwidth for a given signal-to-noise ratio.
aptX is a family of proprietary audio codec compression algorithms owned by Qualcomm, with a heavy emphasis on wireless audio applications.
Echo suppression and echo cancellation are methods used in telephony to improve voice quality by preventing echo from being created or removing it after it is already present. In addition to improving subjective audio quality, echo suppression increases the capacity achieved through silence suppression by preventing echo from traveling across a telecommunications network. Echo suppressors were developed in the 1950s in response to the first use of satellites for telecommunications.
Jamulus is open source (GPL) networked music performance software that enables live rehearsing, jamming and performing with musicians located anywhere on the internet. Jamulus is written by Volker Fischer and contributors using C++. The Software is based on the Qt framework and uses the OPUS audio codec. It was known as "llcon" until 2013.
{{cite web}}
: CS1 maint: numeric names: authors list (link){{cite web}}
: CS1 maint: numeric names: authors list (link)