Defeat Banding – Part II

Recently Banding has finally become an hot topic in encoding optimization. As discussed in this previous post, it is nowadays one of the worst enemy for an encoding expert especially when trying to fine-tune content-aware encoding techniques.

Banding emerges when compression reduces too much high frequencies locally on a frame and this splits a gradient in individual bands of flat colors. Those bands are therefore easily visible and reduce the perceptual quality.

For years I’ve underlined that even a useful metric like VMAF was not able to efficiently identify banding and that we needed something more specific or a metric like VMAF but more sensible to artifacts in dark or flat parts of pictures and hopefully a no-reference metric to be usable for source file assessment as well as compressed ones.

FIG.1 – Lack of correlation between VMAF and MOS in case of sequences with banding (Source: Netflix)

As anticipated in the previous post, I started in 2020 experimenting with some PoC about a metric to measure banding and the next year I validated the logic working for one of my clients at a “bandingIndex” metric. I’ll call it bIndex for sake of simplicity.

Significantly, even Netflix was working on banding and presented (Oct 2021) their banding detection metric Cambi. Cambi is a consistent no-reference banding detector based on pixel analysis and thresholding, plus many optimizations to have solid and accurate banding identification.

The logic I’ve used is very different from Cambi and can be used to identify not only banding but many types of impairments using what I call the “auto-similarity” principle.

The logic of source-impaired similarity

The logic I explored is illustrated in the picture below:

FIG 2 – Auto-Similarity principle

A source video is impaired to introduce an artifact like blocking, banding, ringing, excessive quantization and similar.

if the impaired version of a video is still similar to the not-impaired self, this means that the original video has already a certain degree of that impairment. That degree is inversely proportional to the similarity index.

I call it “source-impaired similarity” or sometimes “auto-similarity” because a video is compared to itself plus an injected, controlled and known impairment. The impairment need to be one-off and not cumulative. Let me explain better:

For one-off impairment I mean a modification that produces its effect only the first time it is applied. For example a color-to-gray filter has that characteristic, if you apply it a second time, the result doesn’t change anymore.

Now we have to things to choose: the impairing filter and the similarity metric.

So let suppose we want to find if a portion of video has banding, or excessive quantization artifacts, we can, in this case, use as impairment a quantization in frequency domain. This form of impairment has the characteristic described above: when applied multiple times, only the first application produce a distortion, the next ones do not modify the picture that’s already quantized with a known quantization level.

The most used similarity metric is SSIM. It maxes to 1 when videos are identical and goes below 1 when dissimilarities arise. It is more perceptual aware than PSNR and more insensible to small deltas as long as statistical indicators like mean, variance and covariance are similar.

It’s very important to analyze the video divided in small portions and not as a whole, especially during metric fine-tuning, to better understand how set thresholding and verify the correct identification of the artifact. Then it is possible to calculate also an “area coverage percentage” that provides interesting information about the amount of frame area impacted by the artifact under test (banding or other).

The high level schema below illustrates the metric calculation. The fine tuning of the metric requires other processings like pre-conditioning (that may be useful to exalt the artifact), appropriate elaboration of SSIM values to keep only the desired information (non-linear mapping and thresholding), final aggregation of data to summarize (pooling) a significative index for each frame.

FIG. 3 – Extraction of bIndex

Conclusions

To develop, verify and fine tune the bIndex metric, I extended a custom player I developed in the past for frame-by-frame and side-by-side comparison. In the pictures below you can see indexes for each frame-area that are green when banding is not visible and are red when banding is visible and annoying. The first picture shows also an overlayed, seekable timeline that plots the banding likeliness for each picture area and the threshold that differentiates between irrelevant and visible/annoying banding. In this way it’s possible to seek quickly to frame sequences that contain banding and evaluate the correctness of the detection.

This approach could be extended to many types of artifacts and used to assess various types of video (sources, mezzanines, compressed video) with different thresholds. Having statistical indicators from frame coverage percentage is also useful to take decisions like source rejection or content re-encoding with specific profiles to fix the problem. Note that currently the thresholds have been identified using perception of small panels of golden-eyes on big-screens but in the future more complex modeling could be used to correlate the objective numbers with perception and introduce other improvements like time-masking and context-aware banding estimation.

15 years of blogging about Internet Video

15 years ago I started this blog to share my esperiments and points of view around video streaming, playback and encoding. It has provided important opportunities to my professional career and extended my circle of contacts in the world of video streaming professionals, and for that I’m grateful…

Unfortunately (or fortunately depending by the point of view) I’ve not always had the time a blog deserve, especially in the last 5 years… but after more than a hundred articles and almost 2 Million contacts I can say that the objective has been nevertheless achieved.

In the meanwhile the trends of technical communication changed profoundly, We’ve seen the rise and trasformation of social media platforms like Facebook and Twitter, the increasing role of Linkein in presenting and sharing Ideas in a professional environment or the role of Youtube as one-stop-shop for presentations and conferences. I think however that a blog can still be a useful place where to consolidate, share and persist ideas and contribute to the community.

For the future, I’m trying to reorganize my activities to find more spare time to disseminate knowledge and experiences had especially in the last 10 years, writing more posts and partecipating more to web conferences (hoping then to restart live partecipation asap).

It could be interesting to completely refresh my series FFmpeg-The Swiss Army Knife of Video Internet (there are so many things to say about it and ways to use it more productively) or analyze technically the state-of-the-art codecs like AV1 and VVC like I did for H.264 and H.265 in the past, or again continue to analyze optimization’s trend and new challenges, especially related to video processing architectures.

I’m rolling up my sleeves, stay tuned…

Let’s rediscover the good old PSNR

In the last years, I’ve been involved in interesting projects around how to measure and/or estimate perceptual quality in video encoding. Measuring quality is by itself useful during encoding optimizations development and monitoring to assess the benefits of an optimized pipeline. But it’s even more interesting to estimate quality you can achieve with specific parameters before encoding so to be able to implement advanced logic in Content-Aware Encoding.

It’s a complex topic but I’d like to focus in this post on the role that PSNR can still have in measuring quality.

Is PSNR not much correlated to perception?

PSNR is well known to be not much correlated to perception. SSIM is a bit better but both show to be scarcely correlated from a general point of view, or at least so it seems:

PSNRvsMOS_0

The scatter-plot shows a number of samples encoded at different resolutions and levels of impairment and PSNR vs MOS (perceptual quality in standard TV-like viewing conditions). You can see that the relationship between PSNR and MOS is not very linear and for example a value of 40dB corresponds to MOS ranging from 1.5 to 5, depending on the content.

It is clear that we cannot use PSNR as an indicator of absolute quality in an encoding. Only at very high values, like 48dB the significance of the measure is useful.

It is because of that scarce correlation with MOS (the optimal metric would stay on the green line in the graph above) that Netflix has defined a metric like VMAF.

VMAF uses a different approach: it calculates multiple elementary metrics and using machine learning build a “blended” estimator that works much better than the individual elementary metrics.

I have worked on a different, but conceptually similar, perception-aware metric in the past. So I know that such metrics may have a problem: are a bit expensive to calculate. This is not because of ML that’s very fast when inferencing, but because you need to use accurate and slow elementary metrics (or a higher number of faster metrics) to have good results in the estimation.

Can PSNR still play a role?

In the common experience, PSNR still communicates something to the compressionists. Professionals like Jan Ozer continue to advocate PSNR for certain type of assessment and I agree that, for example, it is very appreciable especially in relative comparisons, probably thanks to its “linearity” inside specific testing frames. Knowing how ML based metrics work, I admit that PSNR is much more linear and monotonic “locally” while this is not guaranteed in case of ML-based estimators (it depends heavily on the ML-algorithm).

PSNR vs MOS

So, let’s take a look at this scatter plot. The cloud of points provides little information but if we connect the points related to the same video source we see that a structure starts to emerge.

The relationship between PSNR and MOS for the same source can be linearized in the most important part of the chart with an error that in some project can be negligible.

So what we are lacking to use PSNR as a perceptual quality estimator?

We need some other information extracted from the source that provides us with a starting point and a slope. With a fixed point on the chart (es: the PSNR needed to reach 4.5 MOS for a given source) and a slope (the angular coefficient of the approximated linearization of the PSNR-MOS relationship for that specific content), we could be able to use PSNR as an absolute perceptual quality estimator simply projecting the PSNR to the line and recovering the corresponding MOS.

PSNR vs MOS2

Now I’m experimenting exactly around how to quickly find a starting point and a slope from the characteristics of the specific source (scene by scene, of course). The objective is to find a quick method to estimate those params (point and slope) so to be able to measure absolute perceptual quality an order of magnitude quickly than with VMAF (probably with lower accuracy too, but still with good local linearity and monotonic behavior).

This may be very useful in advanced Content-Aware Encoding logic to measure/estimate the final MOS of the current encoding and, for example, adjust params to achieve the desired level (CAE in live encoding is the typical use case).

 

 

 

Be quick! 5 promo codes to save 200$ for a MAX 2011 full conference pass

I’m happy to offer to 5 of my readers the opportunity to obtain a discount of $200 for a full conference pass at Adobe MAX 2011. The first 5 of you that will use the promo code ESSONNATI while registering at the conference (max.adobe.com) will obtain the discounted rate of $1,095 instead of $1,295. Be quick! Max is approaching fast!

It you get in, remember to attend my presentation: “Encoding for performance on multiple device”
http://bit.ly/qvKjP0

Mobile development with AIR and Flex 4.5.1

Recently I have made a very pleasant experience developing a set of mobile applications for the BlackBerry Playbook using Adobe Flex 4.5.1.

In the past I have been critical of Adobe because I believed that Flex for Mobile was not sufficiently smooth on devices and the workflow not efficient, but after this project I had to think again. The main application is not very complex but has given to me the opportunity to evaluate in a real scenario the efficiency of the framework and the performance level on multiple devices.

The final impression is that Adobe is doing really well and after a year of tests and  improvements Flex is becoming an efficient and powerfull cross-device development framework. There are yet some points to improve and some features to implement/enhance but I’m not so critial anymore.

The application I developed is a classic multi media app developed by a media client (Virgin Radio Italy) to offer several multimedia contents for the entertainment of  their mobile users. The app offers:

– A selection of thematic web radio plus the live broadcast of the main radio channel (Virgin Radio Italy)
– A selection of podcasts (MP3) from the main programs of the radio
– The charts/playlist created by the Virgin’s sound designers or voted by the users
– A multi-touch photo gallery
– A selection of VOD contents like video clips, interviews, concerts

The application is now under approval and should be available in the Playbook’s AppWorld in a few days. In the while you can take a look at the UX with this preview video:

More informations about Flash Player 11, AIR 2.7 and above

In a recent post I summarized the informations that it’s possible to find in Internet about the features that will be implemented in the next releases of Flash Player (not necessarily 11, maybe 10.x) and AIR.

At the Flash Camp Brazil, Arno Gourdol (Adobe Flash Runtime Team) has provided a lot of additional informations about what we will able to see in the near feature. Many improvements are seen in a Mobile perspective and this is obvious because mobile clients are expected to be more than desktop clients by 2013.

Take a look at the list of future improvements:

1. Faster Garbage Collection
An incremental GC will be implemented to avoid GC clusters. More control and reduced allocation cost.

2. Faster ActionScript
Typer-based optimizations and the new type “float” (32.bit) will be introduced to enhance JIT compiler performance.

3. Multi Threading
A thread worker model will be implemented to have multiple Actionscript tasks running without blocking UI and to leverage multi-core

4. Threaded video pipeline
Video decoding optimization for standard Video and StageVideo object that leverage multi-core and GPU in parallel

5. HW Renderer
Use as much as possible GPU for grafic, vector and 3D. Stage3D. Hardware compositing.

I think that especially the last point is very important for mobile if Adobe want to fill the gab between native mobile application and AIR/ Flex application. Take a look at the picture below to understand what I mean: with HW acceleration AIR 2.7 will be able improve significantly performance especially on iOS devices and even for Flex application. Let’s hope to see it very soon.

BBC iPlayer and Flash Player, 3 years of love story

In the late December 2007 BBC released the Flash based iPlayer creating the concept of  “catch-up TV” and starting one of the most important Success Story in video distribution over the Internet. The service has had a very rapid adoption rate and is now incredibly popular in UK with over 140 million video requests recorded in November 2010.  Being a long format video site, the total amount of streamed video is very high: approximately 25 millions hrs/month. This make iPlayer one of the top “video streaming source” in the world. For example Yahoo! and VEVO, which are the second and third video sites in U.S. for number of users, deliver “only” around 15 millions of hours / month.

The success of the service can be attributed to the very fluid user experience and good video quality. The use of Flash Player has been the main catalyst of this process and even now that the iPlayer is available on multiple devices and platforms the desktop version is still the most used. Only 4% of video are consumed on mobile devices, 7% on PS3 and Wii (Both using Flash), 16% is consumed on Virgin TV STB and the remaining 73% on a PC desktop. Part of the mobile traffic is related to the Android versione of the iPlayer which is Flash Based too.

Complessively more than 80% of the video streaming is consumed using the Flash Player version of iPlayer.  This percentage is similar to the market share of Flash for Internet video delivery at a worldwide level. Today BBC is using features that only Adobe Flash can offer: streaming protection, dynamic buffering, bitrate switching, DRM protected download (with Adobe Access and AIR client), and very high market penetration (near 100% for desktops plus Android, PS3 and Wii). So it’s not a mistery why the service has become so popular. But you know that BBC is not alone because all the TOP 10 video sites offer video using Flash as the main infrastructure.

To know more statistics about the iPlayer, read the last report published by BBC.

Testing StageVideo in Flash Player 10.2

Adobe has launched the public beta of Flash Player 10.2. This minor update offers us a limited but very important set of improvements:

  • Internet Explorer 9 hardware accelerated rendering – Flash Player 10.2 exploits GPU accelerated graphics (vector rendering, composition,etc..) in Internet Explorer 9.
  • Stage Video hardware acceleration – H.264 decoding, scaling and compositing is performed entirely by the GPU.
  • Native custom mouse cursors – Developers can define custom native mouse cursors, enabling user experience enhancements and improving performance.
  • Support for full screen mode with multiple monitors – Full screen content will remain in full-screen on secondary monitors, especially usefull during video playback.

From my personal point of view the most important improvement is the new StageVideo API. I have introduced the technology in this post, talking about the AIR for TV runtime, but now it is becoming reality for the desktop too.

StageVideo technology allows a direct use of  video acceleration features of the underling hardware. When using StageVideo object intead of the classic Video object you have some limitation (essentially beacause it is not part of the display list but is an external video plane composited by the GPU with the Flash stage), but you have an excellent performance with zero dropped frame, high rendering quality and exceptional performance.

The aim of StageVideo is to optimize decoding performance even in low power CPU scenario (netbook, set top boxes, smart phones and tablet) where it is very import to exploit the dedicated HW acceleration features instead of using the CPU.

Flash Player 10.1 (Mobile or not), already exploited HW acceleration of H.264 decoding and scaling but, especially on Mac and Mobile, some step was still performed by CPU. For example the compositing in the display list was very CPU intensive on bot Mac and Mobile. With Stage Video this is going to change.

For the moment Flash Player 10.2 is only available for the desktop, but AIR for TV has already the support for StageVideo. I hope to see very soon the Mobile version of Flash Player 10.2 because this kind of platform is the one that can take more advantages by StageVideo (perfect performance, higher quality, lower battery consumption).

The performance

If you want to compare the performance of a video decoded with StageVideo and the classic method, install the player and go to this test page. You can also go to YouTube that is already started to support StageVideo. Below you find my results:

Note: the video is 1920×1080 25p at very high bitrate so I suggest you to start playback and then put it in pause (with SPACE) and let it buffer for a while. Press O to switch from StageVideo to standard Video object and monitor the CPU usage.

Laptop Core 2 Duo 2.1 GHz – Windows Vista – IE8 – GeForce 8400M

With standard Video Object: 45-50% (H.264 decoding is accelerated on win 7/vista but not the full path to screen)

With StageVideo : 3-5%

Desktop Quad Core 2.4 GHz – Windows XP sp3 – IE7 – ATI Radeon 3400

With standard Video Object: 30-35% (H.264 is decoded by software here on 4 cores)

With StageVideo: 15-20% (this GPU is not very powerful but still impressing considering that VLC requires 20-25%)

Desktop Core i7 2.8 GHz – Windows 7 x64 -FF 3.6 – ATI Radeon 5750

With standard Video Object: 10% (H.264 decoding is accelerated on win7 /vista but not the full path to screen)

With StageVideo: 0% ! (Yes, you read right, zero percent)

Mac iBook Core 2 Duo 2.2 GHz – OSX – Safari – GeForce 8400M

With standard Video Object: >40% (H.264 decoding “should” be accelerated on nvidia cards and latest OSX and Safari)

With StageVideo: <20%

Very very impressive

To know more details about how to support StageVideo in your players read this article by Thibault Imbert.

H.264 for image compression

Recently Google has presented the WebP initiative. The initiative proposes a substitute of JPEG for image compression using the intra frame compression technique of the WebM project (Codec VP8).
Not everybody know it but also H.264 is very good at encoding still picture and differently from WebM or WebP the 97% of PC are already capable to decode it (someone has said ‘Flash Player’ ?).
The intra-frame compression techniques implemented in H.264 are very efficient, much more advanced than the old JPG and superior to WebP too. So let’s take a look at how to produce such file and the advantage of using it inside Flash Player.

JPG image compression

JPG is an international standard approved in 1992-94. It has been one of the most important technology for the web because without an efficient way to compress still pictures the web would not be what it is today. JPEG is usually capable to compress image size 1:10. The encoder performs these steps:

1. Color Space convertion from RGB to YCbCr
2. Chroma sub sampling, usually to 4:2:0 (supported also 4:2:2 or 4:4:4)
3. Discrete Cosine Transform of 8×8 blocks
4. Quantization
5. Entropy Coding (ZigZag RLE and Huffman)

The algorithm is well known and robust and is used in almost every electronic device with a color display, but obviously in the last 15 years the scientists have developed more advanced algorithms to encode still pictures. One of this is JPEG2000 which leverages Wavelets to encode picture. But the problem of improving intra frame compression is very important also in video encoding because this is the kind of compression used for Keyframes. So H.263 before and H.264 after proposed more optimized ways to encode a single picture.

H.264 intra frame compression

H.264 contains a number of new features that allow it to compress images much more efficiently than JPG.

New transform design

Differently from JPG, an exact-match integer 4×4 spatial block transform is used instead of the well known 8×8 DCT. It is conceptually similar to DCT but with less ringing artifacts.  There is also a 8×8 spatial block transform for less detailed areas and chroma.

A secondary Hadamard Transform (2×2 on chroma and 4×4 on luma) can be usually performed on “DC” coefficients to obtain even more compression in smooth regions.

There is also an optimized quantization and two possible zig-zag pattern for Run Length Encoding of transformed coefficients.

Intra-frame compression

H.264 introduces complex spatial prediction for intra-frame compression.
Rather than the “DC”-only prediction found in MPEG2 and the transform coefficient prediction found in H.263+, H.264 defines 6 prediction directions (modes) to predict spatial information from neighbouring blocks when encoded using 4×4 transform. The encoder tries to predict the block interpolating the color value of adiacent blocks. Only the delta signal is therefore transmitted.

There are also 4 prediction modes for smooth color zones (16×16 blocks). Residual data are coded with 4×4 trasforms and a further 4×4 Hadamard trasform is used for DC coefficients.

Improved quantization

A new logarithmic quantization step is used (compound rate 12%). It’s also possible to use Frequency-customized quantization scaling matrices selected by the encoder for perceptual-based quantization optimization.

Inloop deblocking filter

An adaptive deblocking filter is applied to reduce eventual blocking artifacts at high compression ratio.

Advanced Entropy Coding

H.264 can use the state of the art in entropy coding: Context Adaptive Binary Arithmetic Coding (CABAC) which is much more efficient than the standard Huffman coding used in JPG.

JPEG vs H.264

The techniques used in H.264 double the efficiency of the compression. That is, you can achieve the same quality at half the size. The efficiency is even higher at very high compression ratio (1:25 +) where JPG introduces so many artifact to be completely unusable.

This is a detail of a 1024×576 image compressed to around 50KBytes both in JPG (using PaintShopPro) and H.264 (using FFMPEG). The picture has a x2 zoom to better show the artifacts:

I have estimated a reduction in size of around 40-50% at the same perceived quality, especially at high compression ratios.

WebP vs H.264

WebP is based on the intra frame compression technique of the codec VP8. I compared H.264 with VP8 in this article. VP8 is a good codec and its intra frame compression is very similar to H.264. The difference is that VP8 does not support the 8×8 block transform (which is a feature of H.264 High profile) and can only encode in 4:2:0 (H.264 support 4:4:4).  So both should have approximately the same performance at the common (in 4:2:0). The problem of WebP is the support which is now almost zero while H.264 can be decoded by Flash (97% of desktop + android + rim) and also by iOS devices (via HTML5).

How to encode and display on a page

Now let’s start to encode pictures in H.264. The container could be .mp4 or .flv. FLV is lighter than .mp4 but .mp4 has far more support outside Flash. This is the command line to use with FFMPEG:

ffmpeg.exe -i INPUT.jpg -an -vcodec libx264 -coder 1 -flags +loop -cmp +chroma -subq 10 -qcomp 0.6 -qmin 10 -qmax 51 -qdiff 4 -flags2 +dct8x8 -trellis 2 -partitions +parti8x8+parti4x4 -crf 24 -threads 0 -r 25 -g 25 -y OUTPUT.mp4

The -crf parameter changes the quality level. Try with values from 15 to 30 (the final effect depends by frame size). You can also resize the frame prior to encode using the parameter -s WxH (es: -s 640×360).

To display the picture encoded in H.264 you can use this simple AS3 code:

var nc:NetConnection = new NetConnection();
nc.connect(null);
var ns:NetStream = new NetStream(nc);
video.attachNetStream(ns);
video.smoothing = true;
nc.client = this;
ns.client = this;
ns.play("OUTPUT.mp4");
stage.scaleMode = "noBorder";

The Advantage of using Flash for serving picture in H.264

The main advantage of using H.264 for pictures is in the superior compression ratio. But it is not practical in an every day scenario to substitute every istance of the common <img> tag with an SWF.
However there’s a kind of application that can have enormous benefits from using this approach: the display of big, high quality copyrighted pictures. Instead of access low quality, watermarked JPG, it could be possible to server such big, high quality pictures as H.264 streams from a Flash Media Server and protect the delivery using RTMPE protocol and SWF authentication. On top of that, for a Bullet-proof protection, you could even protect the H.264 payload encrypting the content with a robust DRM like Adobe Access 2.0 . Not bad.

MAX 2010 : H.264 Encoding Strategies for All Screens

This is the title of my presentation at MAX 2010. For the third time I am invited to speak at MAX about video encoding. This year I’ll focus my presentation on encoding best-practices for mobile (Android) and HTTP Dynamic Streaming (formerly Project Zeri).

The abstract : “Learn how to create amazing H.264 video that performs well on large and small screens from one of the industry masters of encoding for H.264 video. The session will begin by discussing the fundamentals of encoding H.264 for Flash and cover encoding profiles, buffering techniques, hardware acceleration, and optimizing H.264 for mobile screens. The session will review Adobe’s recommendations for video encoding for HTTP Dynamic Streaming and how you can make your video look great.”

“one of the industry masters” ? Wow I didn’t know it 😉
I’m sure it will be a great experience as always, so if you want to know more about these topics, join us at MAX 2010, Los Angeles 23-27 October. My presentation is 27 October, 8:00 am (damn I’ll have to wake up early in the morning :-[ ).