How H.264 works – Part I.
As promised, after a couple of introductionary posts, let’s take a look at how H.264 works.
H.264, MPEG-4 Part 10 (or AVC) was written by the ITU-T together with the ISO/IEC Moving Picture Experts Group (MPEG) as the product of a collective partnership effort known as the Joint Video Team (JVT). The ITU-T H.264 standard and the ISO/IEC MPEG-4 Part 10 standard (formally, ISO/IEC 14496-10) are technically identical. The final drafting work on the first version of the standard was completed in May of 2003.
H.264 contains a number of new features that allow it to compress video much more effectively than older H.26x standards. It is the actual state-of-the-art encoder.
New trasform design
Differently from older codec, an exact-match integer 4×4 spatial block transform is used instead of the well known 8×8 DCT. It is conceptually similar to DCT but with less ringing artifacts. To transform a block from spatial to frequency domain allows the encoder to apply psico-visual models to reduce details which are less importan from a perceptual point of view. There is also a 8×8 spatial block transform for less detailed areas and chroma.
A secondary Hadamard Transform (2×2 on chroma and 4×4 on luma) can be usually performed on “DC” coefficients to obtain even more compression in smooth regions.
There is also an optimized quantization and two possible zig-zag pattern for Run Length Encoding of transformed coefficients.
H.264 introduces complex spatial prediction for intra-frame compression.
Rather than the “DC”-only prediction found in MPEG2 and the transform coefficient prediction found in H.263+, H.264 defines 6 prediction directions (modes) to predict spatial information from neighbouring blocks when encoded using 4×4 transform. The encoder try to predict the block interpolating the color value of adiacent blocks. Only the delta signal is therefore encoded.
There are also 4 prediction modes for smooth color zones (16×16 blocks). Residual data are coded with 4×4 trasforms and a further 4×4 Hadamard trasform is used for DC coefficients.
A new logarithmic quantization step is used (compound rate 12%). It’s also possible to use Frequency-customized quantization scaling matrices selected by the encoder for perceptual-based quantization optimization.
Multiple Reference Frames
H.264 uses previously-encoded pictures as references in a much more flexible way than in past standards, allowing up to 16 reference pictures to be used (unlike in prior standards, where the limit was typically one or, in the case of conventional B frame, two). In certain scenarios, for example scenes with rapid repetitive flashing or back-and-forth scene cuts or uncovered background areas, it allows a very significant reduction in bit rate.
TO BE CONTINUED…