Defeat Banding – Part II – Video Encoding & Streaming Technologies

Recently Banding has finally become an hot topic in encoding optimization. As discussed in this previous post, it is nowadays one of the worst enemy for an encoding expert especially when trying to fine-tune content-aware encoding techniques.

Banding emerges when compression reduces too much high frequencies locally on a frame and this splits a gradient in individual bands of flat colors. Those bands are therefore easily visible and reduce the perceptual quality.

For years I’ve underlined that even a useful metric like VMAF was not able to efficiently identify banding and that we needed something more specific or a metric like VMAF but more sensible to artifacts in dark or flat parts of pictures and hopefully a no-reference metric to be usable for source file assessment as well as compressed ones.

FIG.1 – Lack of correlation between VMAF and MOS in case of sequences with banding (Source: Netflix)

As anticipated in the previous post, I started in 2020 experimenting with some PoC about a metric to measure banding and the next year I validated the logic working for one of my clients at a “bandingIndex” metric. I’ll call it bIndex for sake of simplicity.

Significantly, even Netflix was working on banding and presented (Oct 2021) their banding detection metric Cambi. Cambi is a consistent no-reference banding detector based on pixel analysis and thresholding, plus many optimizations to have solid and accurate banding identification.

The logic I’ve used is very different from Cambi and can be used to identify not only banding but many types of impairments using what I call the “auto-similarity” principle.

The logic of source-impaired similarity

The logic I explored is illustrated in the picture below:

A source video is impaired to introduce an artifact like blocking, banding, ringing, excessive quantization and similar.

if the impaired version of a video is still similar to the not-impaired self, this means that the original video has already a certain degree of that impairment. That degree is inversely proportional to the similarity index.

I call it “source-impaired similarity” or sometimes “auto-similarity” because a video is compared to itself plus an injected, controlled and known impairment. The impairment need to be one-off and not cumulative. Let me explain better:

For one-off impairment I mean a modification that produces its effect only the first time it is applied. For example a color-to-gray filter has that characteristic, if you apply it a second time, the result doesn’t change anymore.

Now we have to things to choose: the impairing filter and the similarity metric.

So let suppose we want to find if a portion of video has banding, or excessive quantization artifacts, we can, in this case, use as impairment a quantization in frequency domain. This form of impairment has the characteristic described above: when applied multiple times, only the first application produce a distortion, the next ones do not modify the picture that’s already quantized with a known quantization level.

The most used similarity metric is SSIM. It maxes to 1 when videos are identical and goes below 1 when dissimilarities arise. It is more perceptual aware than PSNR and more insensible to small deltas as long as statistical indicators like mean, variance and covariance are similar.

It’s very important to analyze the video divided in small portions and not as a whole, especially during metric fine-tuning, to better understand how set thresholding and verify the correct identification of the artifact. Then it is possible to calculate also an “area coverage percentage” that provides interesting information about the amount of frame area impacted by the artifact under test (banding or other).

The high level schema below illustrates the metric calculation. The fine tuning of the metric requires other processings like pre-conditioning (that may be useful to exalt the artifact), appropriate elaboration of SSIM values to keep only the desired information (non-linear mapping and thresholding), final aggregation of data to summarize (pooling) a significative index for each frame.

Conclusions

To develop, verify and fine tune the bIndex metric, I extended a custom player I developed in the past for frame-by-frame and side-by-side comparison. In the pictures below you can see indexes for each frame-area that are green when banding is not visible and are red when banding is visible and annoying. The first picture shows also an overlayed, seekable timeline that plots the banding likeliness for each picture area and the threshold that differentiates between irrelevant and visible/annoying banding. In this way it’s possible to seek quickly to frame sequences that contain banding and evaluate the correctness of the detection.

This approach could be extended to many types of artifacts and used to assess various types of video (sources, mezzanines, compressed video) with different thresholds. Having statistical indicators from frame coverage percentage is also useful to take decisions like source rejection or content re-encoding with specific profiles to fix the problem. Note that currently the thresholds have been identified using perception of small panels of golden-eyes on big-screens but in the future more complex modeling could be used to correlate the objective numbers with perception and introduce other improvements like time-masking and context-aware banding estimation.