Flexible Macroblock Ordering

Last updated August 20, 2023

Flexible Macroblock Ordering or FMO is one of several error resilience tools defined in the Baseline profile of the H.264/MPEG-4 AVC video compression standard.

Description

One of the characteristics of the H.264/AVC standard is the possibility of dividing an image into regions called slices, each of which contains a sequence of macroblocks and can be decoded independently of other slices. These macroblocks are processed in a scan order, normally left to right, beginning at the top. A frame can be composed of a single slice, or multiple slices for parallel processing and error-resilience, because errors in a slice only propagate within that slice.

Flexible Macroblock Ordering enhances this by allowing macroblocks to be grouped and sent in any direction and order, and can be used to create shaped and non-contiguous slice groups.^[1] This way, FMO allows more flexibly deciding what slice macroblocks belong to, in order to spread out errors^[2] and keep errors in one part of the frame from compromising another part of the frame. FMO builds on top of another error-resilience tool, Arbitrary slice ordering, because each slice group can be sent in any order and can optionally be decoded in order of receipt, instead of in the usual scan order.

Individual slices still have to be continuous horizontal regions of macroblocks, but with FMO's slice groups, motion compensation can take place within any contiguous macroblocks through the entire group; effectively, each slice group is treated as one or more contiguous shaped slices for the purposes of motion compensation.

Nearly all video codecs allow Region of Interest coding, in which specific macroblocks are targeted to receive more or less quality, the canonical example being a newscaster's head being given a higher ratio of bits than the background. FMO's primary benefit when combined with RoI coding is the ability to prevent errors in one region from propagating into another region. For example, if a background slice is lost, the background may be corrupted for some time but the newscaster's face will not be affected, and it becomes simpler to send regular refreshes of the most important slice to make up for any errors there.

Slices used with FMO are not static, and can change as circumstances change, such as tracking a moving object. A structure called the MBAmap maps each macroblock to a slice group, and can be updated at any time, with a few default patterns defined, such as Slice Interleaving (groups alternate every scanline) or Scattered Slices (groups alternate every block).^[3] With these patterns, FMO allows one retain a better localized visual context so that error-concealment algorithms can reconstruct missing content.^[3]

Certain advanced encoding techniques can simulate some of FMO's benefits. In H.264/AVC, P (predicted) and B (bipredicted) frames may contain I (intra) blocks, which store independent picture. Rather than create a slice in order to periodically refresh entirely with I or IDR frames, I-blocks can be sent in any desired pattern while predicted blocks make up the rest of the picture. Although errors will still propagate horizontally, I-blocks can be sent in patterns, such as favoring a region of interest or a scattered checkerboard, to simulate shaped slice refreshes. With bidirectional communication to the client, lost slices can be refreshed as soon as detected, but this is not feasible for wider broadcast.

Tradeoffs

FMO is only allowed within the Baseline and Extended profiles. The much more common Constrained Baseline, Main, and all High profiles do not support it, and software that can create or decode it is rare. Some videoconferencing units use it; otherwise, the JM reference software is the primary support.^[4]

Using multiple slices per picture always lowers coding efficiency, and FMO can further impact it. The more spread out the slices are, the worse it becomes, with checkerboard patterns (see Scattered Slices below) being the worst. The goals of spreading out errors and coding efficiency are directly in conflict. FMO allows inter prediction for immediate neighboring slices in the same group, effectively making a contiguous region nearly act like a single slice; in some situations, where slice groups are shaped into a Region of Interest, it can actually slightly improve efficiency over simple standard slices, but the benefit is rare and small. Due to this, FMO should only be used where packet losses are common and expected.

Aside from increased complexity in encoding and decoding, and lower efficiency, in-loop deblocking also creates a problem: Slices can be sent in any order, but the deblocker requires all . Either the deblocker has to run in multiple passes whenever another slice is received, or an entire picture needs to be buffered before beginning the deblocking, possibly creating additional latency if slices are delayed long enough that the next picture's slices start coming in first.^[3]

Implementation details

When using FMO, the image can be divided in different scan patterns of the macroblocks, with several built-in patterns defined in the spec, signaled as 0–5 in the unit slice_group_map_type, and one option to include an entire explicitly assigned MBAmap, signaled as 6. The map type and a new MBAmap can be sent at any time.^[5]

Interleaved slice groups, type 0: Every row is a different slice, alternating as many times as slice groups. Only horizontal prediction vectors are allowed.
Scattered or dispersed slice groups, type 1: Every macroblock is a different slice. With two slice groups, it creates a checkerboard pattern; four or more groups also interleave rows, and with six slice groups, no macroblock will ever touch another from the same slice group in any direction, maximizing error concealment opportunities. No vector prediction is possible.
Foreground groups, type 2: Specifying only the top-left and bottom-right of static rectangles to create regions of interest. All areas not covered are assigned to a final group. Vector prediction is possible within each rectangle and within the background. The behavior of overlapping rectangles is undefined, but in the reference software the last slice group to define it is used.
Changing groups, types 3-5: Similar to type 2, but dynamic types that grow and shrink in a cyclic way. Only the growth rate, the direction and the position in the cycle have to be known.
Explicit groups, type 6: An entire MBAmap is transmitted with groups arranged in any way the encoder wishes. Vector prediction is possible within any contiguous regions of the same group.

(In the above image, "Type 0" shows standard H.264 slices, not interleaved slice groups.)

Related Research Articles

H.263 is a video compression standard originally designed as a low-bit-rate compressed format for videotelephony. It was standardized by the ITU-T Video Coding Experts Group (VCEG) in a project ending in 1995/1996. It is a member of the H.26x family of video coding standards in the domain of the ITU-T.

<span class="mw-page-title-main">Compression artifact</span> Distortion of media caused by lossy data compression

A compression artifact is a noticeable distortion of media caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it becomes small enough to be stored within the desired disk space or transmitted (streamed) within the available bandwidth. If the compressor cannot store enough data in the compressed version, the result is a loss of quality, or introduction of artifacts. The compression algorithm may not be intelligent enough to discriminate between distortions of little subjective importance and those objectionable to the user.

Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, is a video compression standard based on block-oriented, motion-compensated coding. It is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. It supports a maximum resolution of 8K UHD.

H.261 is an ITU-T video compression standard, first ratified in November 1988. It is the first member of the H.26x family of video coding standards in the domain of the ITU-T Study Group 16 Video Coding Experts Group. It was the first video coding standard that was useful in practical terms.

In the field of video compression a video frame is compressed using different algorithms with different advantages and disadvantages, centered mainly around amount of data compression. These different algorithms for video frames are called picture types or frame types. The three major picture types used in the different video algorithms are I, P and B. They are different in the following characteristics:

An inter frame is a frame in a video compression stream which is expressed in terms of one or more neighboring frames. The "inter" part of the term refers to the use of Inter frame prediction. This kind of prediction tries to take advantage from temporal redundancy between neighboring frames enabling higher compression rates.

H.262 or MPEG-2 Part 2 is a video coding format standardised and jointly maintained by ITU-T Study Group 16 Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), and developed with the involvement of many companies. It is the second part of the ISO/IEC MPEG-2 standard. The ITU-T Recommendation H.262 and ISO/IEC 13818-2 documents are identical.

x264 is a free and open-source software library and a command-line utility developed by VideoLAN for encoding video streams into the H.264/MPEG-4 AVC video coding format. It is released under the terms of the GNU General Public License.

Motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

The macroblock is a processing unit in image and video compression formats based on linear block transforms, typically the discrete cosine transform (DCT). A macroblock typically consists of 16×16 samples, and is further subdivided into transform blocks, and may be further subdivided into prediction blocks. Formats which are based on macroblocks include JPEG, where they are called MCU blocks, H.261, MPEG-1 Part 2, H.262/MPEG-2 Part 2, H.263, MPEG-4 Part 2, and H.264/MPEG-4 AVC. In H.265/HEVC, the macroblock as a basic processing unit has been replaced by the coding tree unit.

In video coding, a group of pictures, or GOP structure, specifies the order in which intra- and inter-frames are arranged. The GOP is a collection of successive pictures within a coded video stream. Each coded video stream consists of successive GOPs, from which the visible frames are generated. Encountering a new GOP in a compressed video stream means that the decoder doesn't need any previous frames in order to decode the next ones, and allows fast seeking through the video.

A deblocking filter is a video filter applied to decoded compressed video to improve visual quality and prediction performance by smoothing the sharp edges which can form between macroblocks when block coding techniques are used. The filter aims to improve the appearance of decoded pictures. It is a part of the specification for both the SMPTE VC-1 codec and the ITU H.264 codec.

Reference frames are frames of a compressed video that are used to define future frames. As such, they are only used in inter-frame compression techniques. In older video encoding standards, such as MPEG-2, only one reference frame – the previous frame – was used for P-frames. Two reference frames were used for B-frames.

Video Acceleration API (VA-API) is an open source application programming interface that allows applications such as VLC media player or GStreamer to use hardware video acceleration capabilities, usually provided by the graphics processing unit (GPU). It is implemented by the free and open-source library libva, combined with a hardware-specific driver, usually provided together with the GPU driver.

Video Decode and Presentation API for Unix (VDPAU) is a royalty-free application programming interface (API) as well as its implementation as free and open-source library distributed under the MIT License. VDPAU is also supported by Nvidia.

The Network Abstraction Layer (NAL) is a part of the H.264/AVC and HEVC video coding standards. The main goal of the NAL is the provision of a "network-friendly" video representation addressing "conversational" and "non conversational" applications. NAL has achieved a significant improvement in application flexibility relative to prior video coding standards.

High Efficiency Video Coding (HEVC), also known as H.265 and MPEG-H Part 2, is a video compression standard designed as part of the MPEG-H project as a successor to the widely used Advanced Video Coding. In comparison to AVC, HEVC offers from 25% to 50% better data compression at the same level of video quality, or substantially improved video quality at the same bit rate. It supports resolutions up to 8192×4320, including 8K UHD, and unlike the primarily 8-bit AVC, HEVC's higher fidelity Main 10 profile has been incorporated into nearly all supporting hardware.

Michael J. Horowitz is an American electrical engineer who actively participated in the creation of the H.264/MPEG-4 AVC and H.265/HEVC video coding standards. He is co-inventor of flexible macroblock ordering (FMO) and tiles, essential features in H.264/MPEG-4 AVC and H.265/HEVC, respectively. He is Managing Partner of Applied Video Compression and has served on the Technical Advisory Boards of Vivox, Inc., Vidyo, Inc., and RipCode, Inc.

Arbitrary slice ordering (ASO) in digital video, is an algorithm for loss prevention. It is used for restructuring the ordering of the representation of the fundamental regions (macroblocks) in pictures. This type of algorithm avoids the need to wait for a full set of scenes to get all sources. Typically considered as an error/loss robustness feature.

Coding tree unit (CTU) is the basic processing unit of the High Efficiency Video Coding (HEVC) video standard and conceptually corresponds in structure to macroblock units that were used in several previous video standards. CTU is also referred to as largest coding unit (LCU).

References

↑ Wenger, Stephan; Horowitz, Michael. "FMO: Flexible Macroblock Ordering".
↑ "Error Resiliency and Concealment in H.264 MPEG-4 Part 10".
1 2 3 Wenger, Stephan; Horowitz, Michael. "FMO 101".
↑ "H.264 Reference Software".
↑ Wiegand, Thomas; Sullivan, Gary. "Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 / ISO/IEC 14496-10 AVC)" (PDF).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[fmo-1] Wenger, Stephan; Horowitz, Michael. "FMO: Flexible Macroblock Ordering".

[2] "Error Resiliency and Concealment in H.264 MPEG-4 Part 10".

[fmo101-3] 1 2 3 Wenger, Stephan; Horowitz, Michael. "FMO 101".

[jm-4] "H.264 Reference Software".

[draft-5] Wiegand, Thomas; Sullivan, Gary. "Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 / ISO/IEC 14496-10 AVC)" (PDF).

[1]

[2]

[3]

[4]

[5]