Audio-to-video synchronization

Last updated

Audio-to-video synchronization (AV synchronization, also known as lip sync , or by the lack of it: lip-sync error, lip flap) refers to the relative timing of audio (sound) and video (image) parts during creation, post-production (mixing), transmission, reception and play-back processing. AV synchronization can be an issue in television, videoconferencing, or film.

Contents

In industry terminology, the lip-sync error is expressed as the amount of time the audio departs from perfect synchronization with the video where a positive time number indicates the audio leads the video and a negative number indicates the audio lags the video. [1] This terminology and standardization of the numeric lip-sync error is utilized in the professional broadcast industry as evidenced by the various professional papers, [2] standards such as ITU-R BT.1359-1, and other references below.

Digital or analog audio video streams or video files usually contain some sort of synchronization mechanism, either in the form of interleaved video and audio data or by explicit relative timestamping of data.

Sources of error

There are different ways in which the AV-sync can get incorrectly synchronized.

During creation AV-sync errors happen because of internal AV-sync error due to different signal processing delays between image and sound in video camera and microphone. The AV-sync delay is normally fixed. External AV-sync errors can occur if a microphone is placed far away from the sound source, the audio will be out of sync because the speed of sound is much lower than the speed of light. If the sound source is 340 meters from the microphone, then the sound arrives approximately 1 second later than the light. The AV-sync delay increases with distance. During mixing of video clips normally either the audio or video needs to be delayed so they are synchronized. The AV-sync delay is static but can vary with the individual clip. Video editing effects can delay video causing it to lag the audio.

Transmission (broadcasting), reception and playback that can get introduce AV-sync errors. A video camera with built-in microphones or line-in may not delay sound and video paths by the same amount. Solid-state video cameras (e.g. charge-coupled device (CCD) and CMOS image sensors) can delay the video signal by one or more frames. Audio and video signal processing circuitry exists with significant (and potentially non-constant) delays in television systems. Particular video signal processing circuitry that is widely used and contributes significant video delays include frame synchronizers, digital video effects processors, video noise reduction, format converters and compression systems.

Processing circuits format conversion and deinterlace processing in video monitors can add one or more frames of video delay. A video monitor with built-in speakers or line-out may not delay sound and video paths equally. Some video monitors contain internal user-adjustable audio delays to aid in correction of errors.

Some transmission protocols like RTP require an out-of-band method for synchronizing media streams. In some RTP systems, each media stream has its own timestamp using an independent clock rate and per-stream randomized starting value. A RTCP Sender Report (SR) may be needed for each stream in order to synchronize streams. [3]

Effect of no explicit AV-sync timing

When a digital or analog AV system stream does not have a synchronization method or mechanism, the stream may become out of sync. In film movies these timing errors are most commonly caused by worn films skipping over the movie projector sprockets because the film has torn sprocket holes. Errors can also be caused by the projectionist misthreading the film in the projector.

Synchronization errors have become a significant problem in the digital television industry because of the use of large amounts of video signal processing in television production, television broadcasting and pixelated television displays such as LCD, DLP and plasma displays. Pixelated displays utilize complex video signal processing to convert the resolution of the incoming video signal to the native resolution of the pixelated display, for example converting standard definition video to be displayed on a high definition display. Synchronization problems are commonly caused when significant amounts of video processing is performed on the video part of the television program. Typical sources of significant video delays in the television field include video synchronizers and video compression encoders and decoders. Particularly troublesome encoders and decoders are used in MPEG compression systems utilized for broadcasting digital television and storing television programs on consumer and professional recording and playback devices.

In broadcast television, it is not unusual for lip-sync error to vary by over 100 ms (several video frames) from time to time. AV-sync is commonly corrected and maintained with an audio synchronizer. Television industry standards organizations have established acceptable amounts of audio and video timing error and suggested practices related to maintaining acceptable timing. [4] [1] The EBU Recommendation R37 "The relative timing of the sound and vision components of a television signal" states that end-to-end audio/video sync should be within +40 ms and -60 ms (audio before/after video, respectively) and that each stage should be within +5 ms and -15 ms. [5]

Viewer experience of incorrectly synchronized AV-sync

The result typically leaves a filmed or televised character moving his or her mouth when there is no spoken dialog to accompany it, hence the term lip flap or lip-sync error. The resulting audio-video sync error can be annoying to the viewer and may even cause the viewer to not enjoy the program, decrease the effectiveness of the program or lead to a negative perception of the speaker on the part of the viewer. [6] The potential loss of effectiveness is of particular concern for product commercials and political candidates. Television industry standards organizations, such as the Advanced Television Systems Committee, have become involved in setting standards for audio-video sync errors. [4]

Because of these annoyances, AV-sync error is a concern to the television programming industry, including television stations, networks, advertisers and program production companies. Unfortunately, the advent of high-definition flat-panel display technologies (LCD, DLP and plasma), which can delay video more than audio, has moved the problem into the viewer's home and beyond the control of the television programming industry alone. Consumer product companies now offer audio-delay adjustments to compensate for video-delay changes in TVs and A/V receivers, and several companies manufacture dedicated digital audio delays made exclusively for lip-sync error correction.

Recommendations

For television applications, the Advanced Television Systems Committee recommends that audio should lead video by no more than 15  ms and audio should lag video by no more than 45 ms. [4] However, the ITU performed strictly controlled tests with expert viewers and found that the threshold for detectability is 45 ms lead to 125 ms lag. [1] For film, acceptable lip sync is considered to be no more than 22 milliseconds in either direction. [5] [7]

The Consumer Electronics Association has published a set of recommendations for how digital television receivers should implement A/V sync. [8]

SMPTE ST2064

SMPTE standard ST2064, published in 2015, [9] provides technology to reduce or eliminate lip-sync errors in digital television. The standard utilizes audio and video fingerprints taken from a television program. The fingerprints can be recovered and used to correct the accumulated lip-sync error. When fingerprints have been generated for a TV program, and the required technology is incorporated, the viewer's display device has the ability to continuously measure and correct lip-sync errors. [10] [11]

Timestamps

Presentation time stamps (PTS) are embedded in MPEG transport streams to precisely signal when each audio and video segment is to be presented, to avoid AV-sync errors. However, these timestamps are often added after the video undergoes frame synchronization, format conversion and preprocessing, and thus the lip sync errors created by these operations will not be corrected by the addition and use of timestamps. [12] [13] [14] [15]

The Real-time Transport Protocol clocks media using origination timestamps on an arbitrary timeline. A real-time clock such as one delivered by the Network Time Protocol and described in the Session Description Protocol [16] associated with the media may be used to synchronize media. A server may then be used to for final synchronization to remove any residual offset. [17]

See also

Related Research Articles

<span class="mw-page-title-main">Analog television</span> Television that uses analog signals

Analog television is the original television technology that uses analog signals to transmit video and audio. In an analog television broadcast, the brightness, colors and sound are represented by amplitude, phase and frequency of an analog signal.

<span class="mw-page-title-main">MPEG-2</span> Video encoding standard

MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods, which permit storage and transmission of movies using currently available storage media and transmission bandwidth. While MPEG-2 is not as efficient as newer standards such as H.264/AVC and H.265/HEVC, backwards compatibility with existing hardware and software means it is still widely used, for example in over-the-air digital television broadcasting and in the DVD-Video standard.

<span class="mw-page-title-main">NTSC</span> Analog television system

NTSC is the first American standard for analog television, published in 1941. In 1961, it was assigned the designation System M. It is also known as EIA standard 170.

<span class="mw-page-title-main">Closed captioning</span> Process of displaying interpretive texts to screens

Closed captioning (CC) and subtitling are both processes of displaying text on a television, video screen, or other visual display to provide additional or interpretive information. Both are typically used as a transcription of the audio portion of a program as it occurs, sometimes including descriptions of non-speech elements. Other uses have included providing a textual alternative language translation of a presentation's primary audio language that is usually burned-in to the video and unselectable.

A timecode is a sequence of numeric codes generated at regular intervals by a timing synchronization system. Timecode is used in video production, show control and other applications which require temporal coordination or logging of recording or actions.

<span class="mw-page-title-main">SMPTE timecode</span> Standards to label individual frames of video or film with a timestamp

SMPTE timecode is a set of cooperating standards to label individual frames of video or film with a timecode. The system is defined by the Society of Motion Picture and Television Engineers in the SMPTE 12M specification. SMPTE revised the standard in 2008, turning it into a two-part document: SMPTE 12M-1 and SMPTE 12M-2, including new explanations and clarifications.

<span class="mw-page-title-main">Component video</span> Video signal that has been split into component channels

Component video is an analog video signal that has been split into two or more component channels. In popular use, it refers to a type of component analog video (CAV) information that is transmitted or stored as three separate signals. Component video can be contrasted with composite video in which all the video information is combined into a single signal that is used in analog television. Like composite, component cables do not carry audio and are often paired with audio cables.

<span class="mw-page-title-main">Serial digital interface</span> Family of digital video interfaces

Serial digital interface (SDI) is a family of digital video interfaces first standardized by SMPTE in 1989. For example, ITU-R BT.656 and SMPTE 259M define digital video interfaces used for broadcast-grade video. A related standard, known as high-definition serial digital interface (HD-SDI), is standardized in SMPTE 292M; this provides a nominal data rate of 1.485 Gbit/s.

<span class="mw-page-title-main">Lip sync</span> Matching a speaking or singing persons lip movements to an audio recording

Lip sync or lip synch, short for lip synchronization, is a technical term for matching a speaking or singing person's lip movements with sung or spoken vocals.

MPEG transport stream or simply transport stream (TS) is a standard digital container format for transmission and storage of audio, video, and Program and System Information Protocol (PSIP) data. It is used in broadcast systems such as DVB, ATSC and IPTV.

The Precision Time Protocol (PTP) is a protocol used to synchronize clocks throughout a computer network. On a local area network, it achieves clock accuracy in the sub-microsecond range, making it suitable for measurement and control systems. PTP is employed to synchronize financial transactions, mobile phone tower transmissions, sub-sea acoustic arrays, and networks that require precise timing but lack access to satellite navigation signals.

<span class="mw-page-title-main">Asynchronous serial interface</span> Standardised transport interface for the broadcast industry

Asynchronous Serial Interface, or ASI, is a method of carrying an MPEG Transport Stream (MPEG-TS) over 75-ohm copper coaxial cable or optical fiber. It is popular in the television industry as a means of transporting broadcast programs from the studio to the final transmission equipment before it reaches viewers sitting at home.

<span class="mw-page-title-main">ATSC tuner</span> Tuner for digital television channels

An ATSCtuner, often called an ATSC receiver or HDTV tuner, is a type of television tuner that allows reception of digital television (DTV) television channels that use ATSC standards, as transmitted by television stations in North America, parts of Central America, and South Korea. Such tuners are usually integrated into a television set, VCR, digital video recorder (DVR), or set-top box which provides audio/video output connectors of various types.

A video signal generator is a type of signal generator which outputs predetermined video and/or television oscillation waveforms, and other signals used in the synchronization of television devices and to stimulate faults in, or aid in parametric measurements of, television and video systems. There are several different types of video signal generators in widespread use. Regardless of the specific type, the output of a video generator will generally contain synchronization signals appropriate for television, including horizontal and vertical sync pulses or sync words. Generators of composite video signals will also include a colorburst signal as part of the output.

Latency refers to a short period of delay between when an audio signal enters a system and when it emerges. Potential contributors to latency in an audio system include analog-to-digital conversion, buffering, digital signal processing, transmission time, digital-to-analog conversion and the speed of sound in the transmission medium.

An audio synchronizer is a variable audio delay used to correct or maintain audio-video sync or timing also known as lip sync error. See for example the specification for audio to video timing given in ATSC Document IS-191. Modern television systems use large amounts of video signal processing such as MPEG preprocessing, encoding and decoding, video synchronization and resolution conversion in pixelated displays. This video processing can cause delays in the video signal ranging from a few microseconds to tens of seconds. If the television program is displayed to the viewer with this video delay the audio-video synchronization is wrong, and the video will appear to the viewer after the sound is heard. This effect is commonly referred to as A/V sync or lip sync error and can cause serious problems related to the viewer's enjoyment of the program.

Display lag is a phenomenon associated with most types of liquid crystal displays (LCDs) like smartphones and computers and nearly all types of high-definition televisions (HDTVs). It refers to latency, or lag between when the signal is sent to the display and when the display starts to show that signal. This lag time has been measured as high as 68 ms, or the equivalent of 3-4 frames on a 60 Hz display. Display lag is not to be confused with pixel response time, which is the amount of time it takes for a pixel to change from one brightness value to another. Currently the majority of manufacturers quote the pixel response time, but neglect to report display lag.

AES67 is a technical standard for audio over IP and audio over Ethernet (AoE) interoperability. The standard was developed by the Audio Engineering Society and first published in September 2013. It is a layer 3 protocol suite based on existing standards and is designed to allow interoperability between various IP-based audio networking systems such as RAVENNA, Livewire, Q-LAN and Dante.

SMPTE 2022 is a standard from the Society of Motion Picture and Television Engineers (SMPTE) that describes how to send digital video over an IP network. Video formats supported include MPEG-2 and serial digital interface The standard was introduced in 2007 and has been expanded in the years since.

<span class="mw-page-title-main">Audio Video Bridging</span> Specifications for synchronized, low-latency streaming through IEEE 802 networks

Audio Video Bridging (AVB) is a common name for a set of technical standards that provide improved synchronization, low latency, and reliability for switched Ethernet networks. AVB embodies the following technologies and standards:

References

  1. 1 2 3 "ITU-R BT.1359-1, Relative Timing of Sound and Vision for Broadcasting" (PDF). ITU. 1998. Retrieved 30 May 2015.
  2. Patrick Waddell; Graham Jones; Adam Goldberg. "Audio/Video Standards and Solutions A Status Report" (PDF). ATSC. Archived from the original (PDF) on 17 February 2016. Retrieved 4 April 2012.
  3. RFC   3550
  4. 1 2 3 IS-191: Relative Timing of Sound and Vision for Broadcast Operations, ATSC, 2003-06-26, archived from the original on 2012-03-21
  5. 1 2 "The relative timing of the sound and vision components of a television signal" (PDF).
  6. Byron Reeves; David Voelker (October 1993). "Effects of Audio-Video Asynchrony on Viewer's Memory, Evaluation of Content and Detection Ability" (PDF). Archived from the original (PDF) on 2 October 2008. Retrieved 2008-10-19.
  7. Sara Kudrle; et al. (July 2011). "Fingerprinting for Solving A/V Synchronization Issues within Broadcast Environments". Motion Imaging Journal. SMPTE. Appropriate A/V sync limits have been established and the range that is considered acceptable for film is +/- 22 ms. The range for video, according to the ATSC, is up to 15 ms lead time and about 45 ms lag time
  8. Consumer Electronics Association. "CEA-CEB20 R-2013: A/V Synchronization Processing Recommended Practice". Archived from the original on 2015-05-30.
  9. ST 2064:2015 - SMPTE Standard - Audio to Video Synchronization Measurement, SMPTE, 2015
  10. SMPTE Standards Update: The Lip-Sync Challenge, SMPTE, 10 December 2013, archived from the original on 2021-12-15
  11. SMPTE Standards Update: The Lip-Sync Challenge (PDF), SMPTE, 10 December 2013, archived from the original (PDF) on 2016-08-26, retrieved 2016-06-09
  12. "MPEG-2 Systems FAQ: 19. Where are the PTSs and DTSs inserted?". Archived from the original on 2008-07-26. Retrieved 2007-12-27.
  13. Arpi (7 May 2003). "MPlayer-G2-dev: mpeg container's timing (PTS values)".
  14. "birds-eye.net: DTS - Decode Time Stamp".
  15. "SVCD2DVD: Author and burn DVDs: AVI to DVD, DivX to DVD, Xvid to DVD, MPEG to DVD, SVCD to DVD, VCD to DVD, PAL to NTSC conversion, HDTV2DVD, HDTV to DVD, BLURAY". www.svcd2dvd.com.
  16. RFC   7273
  17. RFC   7272

Further reading