Inter frame

Last updated

An inter frame is a frame in a video compression stream which is expressed in terms of one or more neighboring frames. The "inter" part of the term refers to the use of Inter frame prediction. This kind of prediction tries to take advantage from temporal redundancy between neighboring frames enabling higher compression rates.

Contents

Inter frame prediction

An inter coded frame is divided into blocks known as macroblocks. After that, instead of directly encoding the raw pixel values for each block, the encoder will try to find a block similar to the one it is encoding on a previously encoded frame, referred to as a reference frame. This process is done by a block matching algorithm. If the encoder succeeds on its search, the block could be encoded by a vector, known as motion vector, which points to the position of the matching block at the reference frame. The process of motion vector determination is called motion estimation.

In most cases the encoder will succeed, but the block found is likely not an exact match to the block it is encoding. This is why the encoder will compute the differences between them. Those residual values are known as the prediction error and need to be transformed and sent to the decoder.

To sum up, if the encoder succeeds in finding a matching block on a reference frame, it will obtain a motion vector pointing to the matched block and a prediction error. Using both elements, the decoder will be able to recover the raw pixels of the block. The following image shows the whole process graphically:

Inter-frame prediction process. In this case, there has been an illumination change between the block at the reference frame and the block which is being encoded: this difference will be the prediction error to this block. Interframe prediction.png
Inter-frame prediction process. In this case, there has been an illumination change between the block at the reference frame and the block which is being encoded: this difference will be the prediction error to this block.

This kind of prediction has some pros and cons:

Because of these drawbacks, a reliable and time periodic reference frame must be used for this technique to be efficient and useful. That reference frame is known as Intra-frame, which is strictly intra coded, so it can always be decoded without additional information.

In most designs, there are two types of inter frames: P-frames and B-frames. These two kinds of frames and the I-frames (Intra-coded pictures) usually join in a GOP (Group Of Pictures). The I-frame doesn't need additional information to be decoded and it can be used as a reliable reference. This structure also allows to achieve an I-frame periodicity, which is needed for decoder synchronization.

Frame types

The difference between P-frames and B-frames is the reference frame they are allowed to use.

P-frame

P-frame is the term used to define the forward Predicted pictures. The prediction is made from an earlier picture, mainly an I-frame or P-frame, so that require less coding data (≈50% when compared to I-frame size).

The amount of data needed for doing this prediction consist of motion vectors and transform coefficients describing prediction correction. It involves the use of motion compensation.

B-frame

B-frame is the term for bidirectionally predicted pictures. This kind of prediction method occupies less coding data than P-frames generally (≈25% when compared to I-frame size) because the prediction is made from either an earlier frame or a later frame or both them. (B-frames can also be less efficient than P-frames under certain cases, [1] e.g.: lossless encoding)

Similar to P-frames, B-frames are expressed as motion vectors and transform coefficients. In order to avoid a growing propagation error, B-frames are not used as a reference to make further predictions in most encoding standards. However, in newer encoding methods (such as H.264/MPEG-4 AVC and HEVC), B-frames may be used as reference for better exploitation of temporal redundancy. [2] [3]

Typical Group Of Pictures (GOP) structure

Illustration of dependencies of the group of pictures scheme IBBPBB... Time goes from left to right. IBBPBB inter frame group of pictures.svg
Illustration of dependencies of the group of pictures scheme IBBPBB... Time goes from left to right.

The typical Group of pictures (GOP) structure is IBBPBBP... The I-frame is used to predict the first P-frame and these two frames are also used to predict the first and the second B-frames. The second P-frame is predicted also using the first I-frame. Both P-frames join to predict the third and fourth B-frames. The scheme is shown in the next picture:

This structure suggests a problem because the fourth frame (a P-frame) is needed in order to predict the second and the third (B-frames). So we need to transmit the P-frame before the B-frames and it will delay the transmission (it will be necessary to keep the P-frame). This structure has strong points:

But it has weak points:

H.264 inter frame prediction improvements

The most important improvements of the H.264 technique in regard to standards before it (especially MPEG-2) are:

More flexible block partition

Luminance block partition of 16×16 (MPEG-2), 16×8, 8×16, and 8×8. The last case allows the division of the block into new blocks of 4×8, 8×4, or 4×4.

H.264 block division.svg

The frame to be coded is divided into blocks of equal size as shown in the picture above. Each block prediction will be blocks of the same size as the reference pictures, offset by a small displacement.

Resolution of up to ¼ pixel motion compensation

Pixels at half-pixel position are obtained by applying a filter of length 6.

H=[1 -5 20 20 -5 1], i.e. half-pixel "b"=A - 5B + 20C + 20D - 5E + F

Pixels at quarter-pixel position are obtained by bilinear interpolation.

While MPEG-2 allowed a ½ pixel resolution, Inter frame allows up to ¼ pixel resolution. That means that it is possible to search a block in the frame to be coded in other reference frames, or we can interpolate nonexistent pixels to find blocks that are even better suited to the current block. If motion vector is an integer number of units of samples, that means it is possible to find in reference pictures the compensated block in motion. If motion vector is not an integer, the prediction will be obtained from interpolated pixels by an interpolator filter to horizontal and vertical directions.

Subpel interpolation.jpg

Multiple references

Multiple references to motion estimation allows finding the best reference in 2 possible buffers (List 0 to past pictures, List 1 to future pictures) which contain up to 16 frames in total. [4] [5] Block prediction is done by a weighted sum of blocks from the reference picture. It allows enhanced picture quality in scenes where there are changes of plane, zoom, or when new objects are revealed.

Multiple references.jpg

Enhanced Direct/Skip Macroblock

Skip and Direct Mode are very frequently used, especially with B-frames. They significantly reduce the number of bits to be coded. These modes are referred to when a block is coded without sending residual error or motion vectors. The encoder will only record that it is a Skip Macroblock. The decoder will deduce the motion vector of Direct/Skip Mode coded block from other blocks already decoded.

There are two ways to deduce the motion: Direct skip.jpg

Temporal
It uses the block motion vector from List 1 frame, located at the same position to deduce the motion vector. List 1 block uses a List 0 block as reference.
Spatial
It predicts the movement from neighbour macroblocks in same frame. A possible criterion could be to copy the motion vector from a neighboring block. These modes are used in uniform zones of the picture where there is not much movement.

Block partition.jpg

In the figure above, pink blocks are Direct/Skip Mode coded blocks. As we can see, they are used very frequently, mainly in B-frames.

Additional information

Although the use of the term "frame" is common in informal usage, in many cases (such as in international standards for video coding by MPEG and VCEG) a more general concept is applied by using the word "picture" rather than "frame", where a picture can either be a complete frame or a single interlaced field.

Video codecs such as MPEG-2, H.264 or Ogg Theora reduce the amount of data in a stream by following key frames with one or more inter frames. These frames can typically be encoded using a lower bit rate than is needed for key frames because much of the image is ordinarily similar, so only the changing parts need to be coded.

See also

Related Research Articles

MPEG-1 is a standard for lossy compression of video and audio. It is designed to compress VHS-quality raw digital video and CD audio down to about 1.5 Mbit/s without excessive quality loss, making video CDs, digital cable/satellite TV and digital audio broadcasting (DAB) practical.

<span class="mw-page-title-main">Motion compensation</span> Video compression technique, used to efficiently predict and generate video frames

Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video data for video compression, for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved.

<span class="mw-page-title-main">Compression artifact</span> Distortion of media caused by lossy data compression

A compression artifact is a noticeable distortion of media caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it becomes small enough to be stored within the desired disk space or transmitted (streamed) within the available bandwidth. If the compressor cannot store enough data in the compressed version, the result is a loss of quality, or introduction of artifacts. The compression algorithm may not be intelligent enough to discriminate between distortions of little subjective importance and those objectionable to the user.

<span class="mw-page-title-main">Advanced Video Coding</span> Most widely used standard for video compression

Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, is a video compression standard based on block-oriented, motion-compensated coding. It is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. It supports a maximum resolution of 8K UHD.

Audio Video Coding Standard (AVS) refers to the digital audio and digital video series compression standard formulated by the Audio and Video coding standard workgroup of China. Work began in 2002, and three generations of standards were published.

In the field of video compression a video frame is compressed using different algorithms with different advantages and disadvantages, centered mainly around amount of data compression. These different algorithms for video frames are called picture types or frame types. The three major picture types used in the different video algorithms are I, P and B. They are different in the following characteristics:

H.262 or MPEG-2 Part 2 is a video coding format standardised and jointly maintained by ITU-T Study Group 16 Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), and developed with the involvement of many companies. It is the second part of the ISO/IEC MPEG-2 standard. The ITU-T Recommendation H.262 and ISO/IEC 13818-2 documents are identical.

x264 is a free and open-source software library and a command-line utility developed by VideoLAN for encoding video streams into the H.264/MPEG-4 AVC video coding format. It is released under the terms of the GNU General Public License.

<span class="mw-page-title-main">Motion estimation</span> Process used in video coding/compression

In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

Global motion compensation(GMC) is a motion compensation technique used in video compression to reduce the bitrate required to encode video. It is most commonly used in MPEG-4 ASP, such as with the DivX and Xvid codecs.

<span class="mw-page-title-main">Block-matching algorithm</span> System used in computer graphics applications

A Block Matching Algorithm is a way of locating matching macroblocks in a sequence of digital video frames for the purposes of motion estimation. The underlying supposition behind motion estimation is that the patterns corresponding to objects and background in a frame of video sequence move within the frame to form corresponding objects on the subsequent frame. This can be used to discover temporal redundancy in the video sequence, increasing the effectiveness of inter-frame video compression by defining the contents of a macroblock by reference to the contents of a known macroblock which is minimally different.

The macroblock is a processing unit in image and video compression formats based on linear block transforms, typically the discrete cosine transform (DCT). A macroblock typically consists of 16×16 samples, and is further subdivided into transform blocks, and may be further subdivided into prediction blocks. Formats which are based on macroblocks include JPEG, where they are called MCU blocks, H.261, MPEG-1 Part 2, H.262/MPEG-2 Part 2, H.263, MPEG-4 Part 2, and H.264/MPEG-4 AVC. In H.265/HEVC, the macroblock as a basic processing unit has been replaced by the coding tree unit.

In video coding, a group of pictures, or GOP structure, specifies the order in which intra- and inter-frames are arranged. The GOP is a collection of successive pictures within a coded video stream. Each coded video stream consists of successive GOPs, from which the visible frames are generated. Encountering a new GOP in a compressed video stream means that the decoder doesn't need any previous frames in order to decode the next ones, and allows fast seeking through the video.

<span class="mw-page-title-main">Intra-frame coding</span>

Intra-frame coding is a data compression technique used within a video frame, enabling smaller file sizes and lower bitrates, with little or no loss in quality. Since neighboring pixels within an image are often very similar, rather than storing each pixel independently, the frame image is divided into blocks and the typically minor difference between each pixel can be encoded using fewer bits.

A deblocking filter is a video filter applied to decoded compressed video to improve visual quality and prediction performance by smoothing the sharp edges which can form between macroblocks when block coding techniques are used. The filter aims to improve the appearance of decoded pictures. It is a part of the specification for both the SMPTE VC-1 codec and the ITU H.264 codec.

Flexible Macroblock Ordering or FMO is one of several error resilience tools defined in the Baseline profile of the H.264/MPEG-4 AVC video compression standard.

Reference frames are frames of a compressed video that are used to define future frames. As such, they are only used in inter-frame compression techniques. In older video encoding standards, such as MPEG-2, only one reference frame – the previous frame – was used for P-frames. Two reference frames were used for B-frames.

A video coding format is a content representation format for storage or transmission of digital video content. It typically uses a standardized video compression algorithm, most commonly based on discrete cosine transform (DCT) coding and motion compensation. A specific software, firmware, or hardware implementation capable of compression or decompression to/from a specific video coding format is called a video codec.

<span class="mw-page-title-main">VP9</span> Open and royalty-free video coding format released by Google in 2013

VP9 is an open and royalty-free video coding format developed by Google.

Coding tree unit (CTU) is the basic processing unit of the High Efficiency Video Coding (HEVC) video standard and conceptually corresponds in structure to macroblock units that were used in several previous video standards. CTU is also referred to as largest coding unit (LCU).

References

  1. "Doom9's Forum - View Single Post - x264 Lossless question".
  2. "Hierarchical B-Frames or B-Pyramid - Video Compression".
  3. "X264 Settings - MeWiki". mewiki.project357.com. Archived from the original on 18 November 2014. Retrieved 12 January 2022.
  4. "A rookie question regarding on B frame in AVC - Doom9's Forum".
  5. "X264 Stats Output, the "ref B L1" part". Archived from the original on 2014-11-22.