Content-Aware Encoding (CAE) and Context-Aware Delivery (CAD) represent the state-of-the-art in video streaming today, independently from the codec used. The industry has taken its time to metabolize these concepts but now they are definitely mainstream:
Every content is different and needs to be encoded differently. Contexts of viewing are different and need to be served differently. Optimization of a streaming service requires CAE and CAD strategies.
I’ve discussed several times these logics and the need for CAE and CAD strategies and I’ve implemented different optimizations for my clients during the years.
Speaking about Content-Aware Encoding, at the beginning we used empiric rules to determine a relationship between the characteristics of the source (eventually classified) and the encoding parameterization to achieve a satisfying level of quality at the minimum possible bitrate. The “quality metric” used to tune the algorithms was usually the compressionist’s perception (or that of a small team of Golden-Eyes) or more rarely a full-featured panel for subjective quality assessment. Following a classical optimization approach (read some thoughts here) we subdivided a complex “domain” like video streaming in subdomains, recursively trying to optimize them individually (and then jointly, if possible) and using human perception tests to guide the decisions.
More recently, the introduction of metrics with high correlation with Human Perception, like VMAF, have helped greatly in designing more accurate CAE models as well as in the verification of the actual quality delivered to clients. But, are all problems solved ? Can we now completely substitute expert’s eye and subjective tests with unexpensive and fast objective metrics correlated to human perception ? The answer is not simple. In my experience, yes & no. It depends on many factors and one of them is accuracy…
A matter of accuracy
In my career I’ve had the fortune and privilege to work with open-minded Managers, Executes and Partners who dared to exit their comfort zone to promote experiments, trials and bold ideas for the sake of quality, optimization and innovation. So in the last decade I’ve had the opportunity to work on a number of innovative and stimulating projects like:
various CAE deployments, studies on human perception to tune video encoding optimizations & filtering, definition of metrics similar to VMAF to train ML algorithms in most advanced CAE implementations and many others. In the continuation of the post I’d like to discuss some problems encountered in this never-ending pursuit for an optimal encoding pipeline.
When VMAF was released, back in 2016, I was intrigued and excited to use it to improve an existing CAE deployment for one of my main clients. If you can substitute an expensive and time consuming subjective panel with a scalable video quality tool, you can multiply the experiments around encoding optimization, video processing, new codecs or other creative ideas about video streaming. A repeatable quality measurement is also useful to “sell” a new idea, because you can demonstrate the benefits it can produce (especially if the metric is developed by Netflix and this brings immediate credit).
However since the beginning VMAF showed in my experiments some sub-optimal behaviors, at least in some scenarios. In particular, what I can even now recognize as the Achille’s heel for VMAF is the drop of accuracy in estimating perceptual quality in dark and/or flat scenes.
In CAE we try to use the minimum possible amount of bits to achieve a desired minimum level of quality. This incidentally brings to very low bitrates in low complexity, flat, scenes. On the other hand, any error in estimating the level of quantization, or target bitrate in such scenes may produce an important deterioration of quality, in particular may introduce a amount of “banding” artifact. Suddenly, a point of strength of CAE becomes a potential point of weakness because a standard CBR encoding could avoid banding in the same situation (nervertheless with a waste of bitrate).
Therefore, an accurate metric is necessary to cope with that problem. Banding is a plague for 8- bit AVC/HEVC encoding, but can appear also in 10-bit HEVC video, especially when the energy of the source is low (maybe because of multiple elaborations) and a wrong quantization level can completely eliminate higher, delicate, residual frequencies and cause banding.
If we use a metric like VMAF to tune a CAE algorithm we need to be careful in such situations and apply “margins” or re-train VMAF to increase the sensibility in such problematic cases (there are also other problematic cases like very grainy noise, but in those I see an underestimation of subjective quality, which is much less problematic to handle).
I’m in good company in saying that VMAF might be not the right choice for all scenarios because even YouTube in the Big Apple 2019 Conference pointed out that VMAF is often not able to recognize properly the presence of banding.
I could hypothesize that this behavior is probably due to the way quality has been assessed in VMAF, for example the distance of 2.5xH could reduce sensibility in those situations, but the problem is still present also in VMAF 4K where distance is 1.5xH so maybe is a weakness of the elementary metrics.
A case in 4K
Let’s analyze a specific case. Recently I’ve conducted a Subjective Quality test on 4K contents, both SDR and HDR/HLG. VMAF 4K is not tuned for HDR so I’ll limit my considerations to the SDR case. The subjective panel has been performed to tune a custom quality metric with support for HDR content that then has been used to train an ML-based CAE deployment for 4K SDR/HDR streaming.
The picture below shows a dark scene used in the panel. On the left you have the original source, on the right you have a strongly compressed version (click on picture to enlarge).
In Figure 3 you can easily see that the image is very damaged. It’s full of banding and also motion (obviously not visible here) is affected, with “marmalade” artifact. However, VMAF reports an average score of 81.8 over 100, equivalent to 4 in 1to5 scale MOS, which overestimates the subjective quality.
The panel (globally 60 people, 9000+ scores , 1.5xH from 50” 4K display, DSIS methodology) reports a MOS of 3.2 which is still high in my opinion, while a small team of Golden EYE reported a more severe 2.3.
From our study, we find that variance in the opinion scores for such type of artifacts increases considerably, maybe because of different individual visual acuity and cultural aspects (not trained to recognize specific artifacts). But a Golden Eye recognizes immediately the poor quality and so also an important percentage of the audience (in our case 58% of the scores were 3 or below) will consider the quality not sufficient, especially for the expectation of 4K.
This is a classical problem of taking into consideration the mean when variance is high. VMAF provides also a Confidence Interval, that’s useful to take better decision but still the prediction has an overestimated “center” for the example above and at least 2 JND distant from the MeanOpinionScore (not to mention Golden Eye’s score).
Anyway, below we can see the correlation between VMAF 4K and subjective evaluation in a subset of the SDR sequences. The points below the area delimited by red lines represent content in which the predicted quality is overestimated by VMAF. Any decision taken using such estimation may lead to a wrong decision and some sort of artefacts.
Still a long journey ahead
VMAF is not a perfect tool, at least not yet. However, it has paved the way toward handy estimation of perceptual quality in a variety of scenarios. What we should do probably is to consider it for what it is: an important “step” in a still very long journey toward accurate and omni-comprehensive quality estimation.
For now, if VMAF is not accurate in your specific scenario, or if you need a different kind of sensitivity, you can re-train VMAF with other data, change/integrate the elementary metrics or make your own metric that focuses on specific requirements (maybe less universal but more accurate in your specific scenario). You could also use an ensemble-like approach, mixing various estimators to mitigate the points of weakness.
I see also other open points to address in the future:
– better temporal masking
– different approach to pooling scores both in time and spatial domain
– extrapolation of quality in different viewing conditions
As a final consideration, I find YouTube’s approach very interesting. They are using no-reference metrics to estimate the quality of source and encoded videos. No-reference metrics are not bound to measure the perceptual degradation of a source-compressed couple of videos, but are designed to estimate the “absolute” quality of the compressed video alone, without access to the source.
I think they are not only interesting to estimate quality when the source is not accessible (or is costly to retrieve and use), like in monitoring of existing live services, but they will be useful also as internal metric for CAE algorithms.
In fact, modern encoding pipelines try often to trade fidelity to the source with “perceptual pleasantness” if this can save bandwidth. Using a no-reference metric instead of a full-reference metric could increase this behaviour similarly to what happened in super resolution passing from a more traditional cost function in DNN training to an “adversarial-style” cost function in GAN.
But this is another story…