HYPER: a decade of challenges and achievements.

Who follows me since the years of Flash Player and Adobe Media Server knows that I’ve been busy in the last 15 years developing encoders, players and in general software architectures to enable, enhance and optimize video streaming at scale. I’ve achieved many professional successes working on innovative projects for companies like NTT Data, Sky, Intel Media, VEVO and many others. In such contexts I’ve had the opportunity to meet inspiring people: managers, engineers, colleagues and ultimately friends that helped me in growing as an engineer and as a video streaming architect.

In particular this 2021 I celebrate the 10th anniversary of Hyper, one of those achievements, but let’s start from the beginning:

Conception

In 2008 I started collaborating with Value Team, a leading system integrator in Italy (later acquired by the global innovator NTT Data). The BBC’s iPlayer was just released and media clients started asking for something similar, so NTT Data contacted me to design a high performance platform (encoder and player) for the nascent market of catch-up TVs and OTT services. The product, VTenc, powered the launch in 2009 of the first catch-up tv in Italy (La7.tv owned by Telecom Italia).

The most innovative feature of VTenc was the possibility to encode a single video in parallel, splitting it in segments then distributed on a computing grid for parallel encoding. The idea emerged after a discussion with Antony Rose and his team (the creators of BBC’s iPlayer) where they underlined that one of the main problems in encoding for a catchup tv was the long processing time that delayed the distribution of the encoded stream after the conclusion of the show on tv.

A few months and many technical challenges later, the feature was ready. And the idea of the parallel encoding was successfully applied to La7.tv: we received the live program in “parts”, emitted every time there was an advertisement slot. Each part, usually 20-25 min long, was diveded into smaller chunks, encoded in parallel and then reassembled and packetized with a map-reduce style paradigm. Also thanks to a client side playlist, the final result was ready for streaming after just 10 minutes from the conclusion of the live show.

An incredible result for that time because commercial encoders required many hours for an accurate 2-pass encoding of assets 2-3 hours long. It was also one of the first implementations in absolute for adaptive bitrate streaming in Flash (Custom implementation in Flash Player 9 + AMS when Adobe introduced officially adaptive bitrate only in Flash Player 10).

The birth of Hyper

VTenc was improved in the following years until in 2011 NTT Data signed a deal with Sky Italy to provide the encoder for the new incoming OTT services of the broadcaster.
We implemented new features and they choose VTenc for a variety of key points:

– flexible queue management system
– rapid customizability
– high video quality
– high density and scalability
– short time to market.

VTenc evolved into something more complex, NTT Data Hyper was born. The model was something different than buying a commercial encoder that usually had long evolution cycles and well defined but unflexible road maps. Hyper has been something more similar to a focused, tailor made and optimized encoding engine like those build by Netflix or Amazon. Often also a sandbox where to conceive and test new technologies and ideas.

In this 2021 we celebrate the first 10 years of Hyper.

A decade of innovative achievements and milestones

Since then we have reached many achievements and milestones facing the challenges of the last decade. Looking back, it has been an exciting journey, professionally intense and enriching. Some achievements that deserve to be mentioned:

– A long story of content-aware encoding approaches from the first empirical versions (2013) to the use of ML to implement peculiar “targetMOS” and “targetDevice” encoding modes (2017+). I’ve been always a fan of “contextual optimizations” and started my experiments in the years of Flash Video, presenting some initial ideas at Adobe Max 2009-11 but then I’ve had the opportunity to implement those paradigms in the industry in various “flavors”.

Internal caching logic for elementary streams that allows the quick repurposing of previously encoded assets without executing a new expensive encoding. With this logic we have been able many times to repurpose libraries (tens of thousands assets) in a matter of days. In this way, add a new audio format, change ident, parental or other elements in the content playlist, or add a new packetization format (es: new version of HLS) has been always quick and inexpensive.

– In 2016 Hyper evolved from a grid computing to a hybrid cloud paradigm with on-prem resources that cope with baseline workload and cloud resources that satisfy peaks. Having thought the software since the beginning around agnostic services and flexible work queues, the hybridization was a natural step. Resources can be partitioned to have maximum throughput and cost efficiency on some queues as well as minimum time-to-output on others. 

– In this last context my team designed a 2-step technique to generate on the fly a smart mezzanine with controlled perceptual quality to quickly and conveniently move very big high quality sources to the cloud for parallelized encoding (with a bandwidth reduction of up to an order of magnitude). 

– Full cloud deployments on AWS and GCloud that quickly and dynamically scale from just 2 on-demand instances to thousands spot instances (“elastic texture”) with optimized scaling logics to minimize infrastructure costs and provide higher reactivity than standard scaling systems like autoscaling groups.

Now that Hyper turned 10 and after various millions of encoding jobs, in 2021 we are going to finalize Hyper v2 and tackle new challenges (VVC, AV1, complete refactoring, perceptual-aware delivery, agnostic architecture to apply massively parallelized processing to other contexts), but that’s the matter for an entirely new story…by now let’s celebrate:

happy 10th birthday Hyper!  


Defeat Banding – Part I

In my last post I have discussed about what I think to be the current arch-enemy of video encoding: “banding“.

Banding can be the consequence of quantization in various scenarios today, particularly when the source is a gradient or a low power textured area and your CAE (Content Aware Encoding) algorith is using an excessive QP.

Banding is more frequent in 8bit encoding but is possible also in 10bit encoding and is also frequent in high quality source files, or mezzanines when they have been subject to many encoding processes.

Modern block based codecs are all prone to banding. Indeed I find h265, VP9 and AV1 to be even more prone to banding than h264 because of wider block transforms (and that contributed to an increase of banding in Youtube and Netflix videos in recent times).

As discussed in the previous post, it is easy to incur in banding also because it is subtle and it’s not easy to measure it. Metrics like PSNR, SSIM but even VMAF are not sensible to banding even if it is easy for an average viewer to spot it, at least in optimal viewing condition.

This is an example of banding:

Picture with banding on the wall

The background shows a consistent amount of banding especially in motion, when the “edges” of the bands move coherently and form a perceptually significative and annoying pattern. Below the picture has the Gamma exalted to better show the banding.

Zoomed picture with exalted gamma

Seek to prevent

To prevent banding is first of all necessary to be able to identify it. This by itself is a complex problem.
Recently I’ve tried to find a way (there are many different approaches) to estimate the likeliness of having perceptually significative banding in a specific portion of a video.

I’m using an auto-correlation approach that is giving interesting preliminary results. So this “banding metric” analyzes only the final picture, without reference to source files (than, in case of mezzanine or sources you obviouly do not have anyway).

For example: here we have a short video sequence. When you watch at it in optimal viewing condition, you can spot some banding on flat areas. The content is quite dark (maybe you can spot someone of familiar in the background 😉 so, as usual, in the continuation I’ll show preferably the frames with exalted gamma.

The algorithm produces the following frame-by-frame report where an index of banding is expressed for each quadrant of the picture (Q1 = Top Left quadrant, Q2 = Top Right quadrant, Q3 = Bottom Left quadrant, Q4 = Bottom Right quadrant).

Below you can see the Frame 1 with exalted gamma. From the graph above, we see that the quadrant with higher banding likeliness is Q2. For the moment I’ve not yet calculated the most appropriate threashold for perceptually visible banding, but empirically it is near 0.98 (horizontal red line) . So in this frame, we have low likeliness to have banding and only a minor probability for Q2.

FRAME 1

In the frame below we have an incresing amount of banding, especially in Q1 but also in Q2 (on the tree and sky). The graph above shows an increasing probability of perceptually visible banding in quandrant Q1 and Q2 and infact they are above the threashold, while Q3 and Q4 are below.

FRAME 173

Then there’s a scene change, and for the new scene the graph reports an high probability of banding for quadrant Q1 and Q3 (click on the image below to zoom) an oscillating behaviour for Q2 (the hands are moving and the dark parts exibit banding in some parts of the scene) while the Q4 quadrant is completely immune from banding.

FRAME 225

Preliminary conclusions

Has discussed, it’s very important to start from the identification and the measurement of banding because if you can find it, you can correct encoding algorithms to better retain details and avoid introducing this annoying artifact. It’s also useful to analyze sources and reject them when any banding is found, otherwise any other consequent encoding will only worsen the problem. The journey to defeat banding is only at the beginning… wish me good luck 😉

Thoughts around VMAF, ContentAwareEncoding and no-ref metrics

 

Introduction

Content-Aware Encoding (CAE) and Context-Aware Delivery (CAD) represent the state-of-the-art in video streaming today, independently from the codec used. The industry has taken its time to metabolize these concepts but now they are definitely mainstream:

Every content is different and needs to be encoded differently. Contexts of viewing are different and need to be served differently. Optimization of a streaming service requires CAE and CAD strategies.

I’ve discussed several times these logics and the need for CAE and CAD strategies and I’ve implemented different optimizations for my clients during the years.

Speaking about Content-Aware Encoding, at the beginning we used empiric rules to determine a relationship between the characteristics of the source (eventually classified) and the encoding parameterization to achieve a satisfying level of quality at the minimum possible bitrate. The “quality metric” used to tune the algorithms was usually the compressionist’s perception (or that of a small team of Golden-Eyes) or more rarely a full-featured panel for subjective quality assessment. Following a classical optimization approach (read some thoughts here) we subdivided a complex “domain” like video streaming in subdomains, recursively trying to optimize them individually (and then jointly, if possible) and using human perception tests to guide the decisions.

More recently, the introduction of metrics with high correlation with Human Perception, like VMAF, have helped greatly in designing more accurate CAE models as well as in the verification of the actual quality delivered to clients. But, are all problems solved ? Can we now completely substitute expert’s eye and subjective tests with unexpensive and fast objective metrics correlated to human perception ? The answer is not simple. In my experience, yes & no. It depends on many factors and one of them is accuracy…

 

A matter of accuracy

In my career I’ve had the fortune and privilege to work with open-minded Managers, Executes and Partners who dared to exit their comfort zone to promote experiments, trials and bold ideas for the sake of quality, optimization and innovation. So in the last decade I’ve had the opportunity to work on a number of innovative and stimulating projects like:
various CAE deployments, studies on human perception to tune video encoding optimizations & filtering, definition of metrics similar to VMAF to train ML algorithms in most advanced CAE implementations and many others. In the continuation of the post I’d like to discuss some problems encountered in this never-ending pursuit for an optimal encoding pipeline.

When VMAF was released, back in 2016, I was intrigued and excited to use it to improve an existing CAE deployment for one of my main clients. If you can substitute an expensive and time consuming subjective panel with a scalable video quality tool, you can multiply the experiments around encoding optimization, video processing, new codecs or other creative ideas about video streaming. A repeatable quality measurement is also useful to “sell” a new idea, because you can demonstrate the benefits it can produce (especially if the metric is developed by Netflix and this brings immediate credit).

However since the beginning VMAF showed in my experiments some sub-optimal behaviors, at least in some scenarios. In particular, what I can even now recognize as the Achille’s heel for VMAF is the drop of accuracy in estimating perceptual quality in dark and/or flat scenes.

In CAE we try to use the minimum possible amount of bits to achieve a desired minimum level of quality. This incidentally brings to very low bitrates in low complexity, flat, scenes. On the other hand, any error in estimating the level of quantization, or target bitrate in such scenes may produce an important deterioration of quality, in particular may introduce a amount of “banding” artifact. Suddenly, a point of strength of CAE becomes a potential point of weakness because a standard CBR encoding could avoid banding in the same situation (nervertheless with a waste of bitrate).

Therefore, an accurate metric is necessary to cope with that problem. Banding is a plague for 8- bit AVC/HEVC encoding, but can appear also in 10-bit HEVC video, especially when the energy of the source is low (maybe because of multiple elaborations) and a wrong quantization level can completely eliminate higher, delicate, residual frequencies and cause banding.

If we use a metric like VMAF to tune a CAE algorithm we need to be careful in such situations and apply “margins” or re-train VMAF to increase the sensibility in such problematic cases (there are also other problematic cases like very grainy noise, but in those I see an underestimation of subjective quality, which is much less problematic to handle).

I’m in good company in saying that VMAF might be not the right choice for all scenarios because even YouTube in the Big Apple 2019 Conference pointed out that VMAF is often not able to recognize properly the presence of banding. 

youtube_band
Figure 1. VMAF overestimates quality on dark, flat, scenes

I could hypothesize that this behavior is probably due to the way quality has been assessed in VMAF, for example the distance of 2.5xH could reduce sensibility in those situations, but the problem is still present also in VMAF 4K where distance is 1.5xH so maybe is a weakness of the elementary metrics.

 

A case in 4K

Let’s analyze a specific case. Recently I’ve conducted a Subjective Quality test on 4K contents, both SDR and HDR/HLG. VMAF 4K is not tuned for HDR so I’ll limit my considerations to the SDR case. The subjective panel has been performed to tune a custom quality metric with support for HDR content that then has been used to train an ML-based CAE deployment for 4K SDR/HDR streaming.

The picture below shows a dark scene used in the panel. On the left you have the original source, on the right you have a strongly compressed version (click on picture to enlarge).

Figure 2. Source (left) vs Compressed (right). Click to Enlarge
Figure 3. Exalted gamma to show artifacts on encoded version. Click to Enlarge

In Figure 3 you can easily see that the image is very damaged. It’s full of banding and also motion (obviously not visible here) is affected, with “marmalade” artifact. However, VMAF reports an average score of 81.8 over 100, equivalent to 4 in 1to5 scale MOS, which overestimates the subjective quality.

The panel (globally 60 people, 9000+ scores , 1.5xH from 50” 4K display, DSIS methodology) reports a MOS of 3.2 which is still high in my opinion, while a small team of Golden EYE reported a more severe 2.3.

From our study, we find that variance in the opinion scores for such type of artifacts increases considerably, maybe because of different individual visual acuity and cultural aspects (not trained to recognize specific artifacts). But a Golden Eye recognizes immediately the poor quality and so also an important percentage of the audience (in our case 58% of the scores were 3 or below) will consider the quality not sufficient, especially for the expectation of 4K.

This is a classical problem of taking into consideration the mean when variance is high. VMAF provides also a Confidence Interval, that’s useful to take better decision but still the prediction has an overestimated “center” for the example above and at least 2 JND distant from the MeanOpinionScore (not to mention Golden Eye’s score).  

Anyway, below we can see the correlation between VMAF 4K and subjective evaluation in a subset of the SDR sequences. The points below the area delimited by red lines represent content in which the predicted quality is overestimated by VMAF. Any decision taken using such estimation may lead to a wrong decision and some sort of artefacts.

vmaf4k_scatterplot1
Figure 4. MOS vs VMAF 4K

 

Still a long journey ahead

VMAF is not a perfect tool, at least not yet. However, it has paved the way toward handy estimation of perceptual quality in a variety of scenarios. What we should do probably is to consider it for what it is: an important “step” in a still very long journey toward accurate and omni-comprehensive quality estimation.

For now, if VMAF is not accurate in your specific scenario, or if you need a different kind of sensitivity, you can re-train VMAF with other data,  change/integrate the elementary metrics or make your own metric that focuses on specific requirements (maybe less universal but more accurate in your specific scenario). You could also use an ensemble-like approach, mixing various estimators to mitigate the points of weakness.

I see also other open points to address in the future:
– better temporal masking
– different approach to pooling scores both in time and spatial domain
– extrapolation of quality in different viewing conditions

As a final consideration, I find YouTube’s approach very interesting. They are using no-reference metrics to estimate the quality of source and encoded videos. No-reference metrics are not bound to measure the perceptual degradation of a source-compressed couple of videos, but are designed to estimate the “absolute” quality of the compressed video alone, without access to the source.

I think they are not only interesting to estimate quality when the source is not accessible (or is costly to retrieve and use), like in monitoring of existing live services, but they will be useful also as internal metric for CAE algorithms.

In fact, modern encoding pipelines try often to trade fidelity to the source with “perceptual pleasantness” if this can save bandwidth. Using a no-reference metric instead of a full-reference metric could increase this behaviour similarly to what happened in super resolution passing from a more traditional cost function in DNN training to an “adversarial-style” cost function in GAN.

But this is another story…

“Time Machine” – my talk at Demuxed 2018

I’ve just returned from a wonderful experience at Demuxed 2018.

speaker1I have had the honor to participate as a speaker alongside professionals from Twitter, Netflix, Youtube, Twitch, Comcast, Intel, Mux, Bitmovin, Akamai … and in general, the experience as both attendee and speaker has been amazing.

The event was streamed live on Twitch but today have been released also the individual VoD recordings (sessions list), including mine.

My session is:

“Time Machine” – how to reconstruct perceptually, during playback, part of the detail lost in encoding.

posterTM

In the last years, I’ve focused my efforts on “joint” optimization of various elements of the streaming pipeline. Evolving from an intra-domain to a inter-domains optimization approach, it is possible to squeeze out much more efficiency.

I’ve worked on joint optimizations of encoding and players, for example. Sometimes throwing in the mix also “augmentations” of protocols. If the player knows how the encoder is optimized it’s possible to develop improved heuristics and vice-versa with a synergic effect. I’ve already discussed a bit about that trend in this previous post.

In this scenario, I have discussed during Demuxed about another un-usual possibility of joint optimization:

Reconstruct perceptually part of the detail loss in encoding using in the Player a GPU-based reconstruction model that uses information extracted by the encoder or ML to estimate the best parameters.

It’s an old idea I’ve been insisting on for years as a way to ultra-optimize the streaming pipeline, with different tunings for high quality and high-efficiency cases (es: mobile).
I proposed a proof-of-concept based on Flash in a 2010 trilogy of posts and spoke about it also at Adobe Max 2010 in Los Angeles.

After the decline of Flash I’ve waited for WebGL to be more generally available in browsers and devices to make the idea evolve. Now WebGL is very powerfull and filtering with complex pixel shaders also high-resolution content is not a problem.

I’ll elaborate more on the logic in a future post. By now, take a look at the recorded speak and/or at PDF presentation: Presentation-Demuxed2018-FabioSonnati.

I’ve been very satisfied with the level and quality of feedbacks on the topic and in general Demuxed has been a wonderful occasion to meet and chat with high level professionals of the streaming business.

Artificial Intelligence in video encoding optimization

ai

Without doubt A.I. is the buzzword of the moment. We can definitely find it used everywhere, ranging from image classification/recognition to language translation, from sentiment analysis to market predictions, not to mention autonomous driving, fitness bands, latest CPUs/GPUs, smartphones and so on. A.I. prophets promise a new era of “intelligent” computing that will disrupt the way we live and use technology.

ML_trend
Fig1. Google Trend for “Artificial Intelligence”

Is it all that glorious? Even if all that glitters is not gold and many of the expectations are over inflated I think that A.I. (or it’s more correct to call it Machine Learning for most of the applications) is already truly capable to empower engineers with new tools and ways to solve problems, make accurate predictions and design complex systems.

As such, why don’t we apply it also in the field of encoding and streaming optimization?

but let’s start from the beginning…

 

Artificial Intelligence Machine Learning

 

From now on I’ll speak about Machine Learning and not Artificial Intelligence. AI is more a marketing slogan than an accurate term to depict current achievements (read this maybe oversimplified yet efficient comparison). In fact, many of the applications often branded as AI-driven are indeed more simply based on ML algorithms.
Not to mention that now that AI is at its peak of inflated expectations in the Gardner’s hype cycle, a lot of more traditional technologies are opportunely rebranded with the new, bold term just to exploit the wave.

ML is not indeed new. It is rooted in the late 50s and 60s when scientists started to study algorithms that can “learn” from data and make predictions based on that data. Algorithms capable to model complex systems from sample inputs and make data-driven predictions or classifications without active modeling by engineers.

ML is based on or is adjacent to other well know disciplines like computational statistic, mathematical optimization, operation research, linear programming; All popular university courses in not so ancient times.

ML has been widely used in the industry for years with success. Every time you use your credit card, a ML-based algorithm estimates the probability of a fraud thanks to classification algorithms trained on a huge amount of transactions (someone has said BigData?). Recognition of digits, OCR, speech recognition, spam detection are other consolidated applications. More recently you find ML-based algorithms in fitness bands to recognize/classify the activity done by users. Netflix has created a famous recommendation engine using ML. Google uses ML extensively for speech recognition, search ranking, form completion, translations. Apple uses it for Siri, among other things and any image classification application is based on deep learning and CNN that are at the cutting-edge of ML.

So it’s true that ML is powerful but it’s nothing exotic. It is essentially a discipline that provides algorithms, methods and best practices that help engineers in creating complex models without analyzing necessarily the underlying phenomena.

Indeed, modeling is something engineers already often do in their daily work. But sometimes analyze and modelize a complex phenomenon is not easy at all. I have already talked about optimization approaches and complex modeling in this post. At the end, instead of studying a complex system by inferring the rules of its subsystems (a classic way to proceed), ML provides engineers with a set of tools to create much more accurate models starting from a wide number of observations and data.

There are many algorithms, techniques, procedures and approaches in ML. A broad distinction is made between supervised learning, unsupervised learning and reinforcement learning. And inside supervised ML we can mention algorithms like linear regression/classification, Support Vector Machine, Random Forest, Decision Trees, Ensemble Methods, Gradient Boosting, Ada Boost and so on, and then continue with the Neural Networks family: Deeplearning, Convolutional NN, Recurrent NN, LSTM RNN, etc…

Wow, it’s a wide and complex landscape where it’s not simple and immediate the choice of the algorithms, the fit and the optimization of the entire system.

There are important points to considerate:

1. ML is a tool-set but then is up to the engineers how to use it in creative and efficient manner. ML doesn’t work by itself!

2. Many ML-algorithms behave like a black-box and it is not easy to extract knowledge of the underlying phenomena from that black-box. Sometimes is preferable a simpler algorithm than a more complex (and more efficient) one when you want to better comprehend the system under study.

3. Overfitting is everywhere! It’s the worst enemy and requires much attention especially to avoid creating models that in reality perform worse than empiric approximations.

 

Machine Learning as a tool to optimize video encoding 

 

In this post I compared optimization to function approximation/estimation. It’s easy to see the parallelism between function approximation and ML-based regression techniques. Using ML is possible to create a model that “predicts” with a good accuracy the behavior of a system for unknown inputs using only a number of known sample points to train/fit a chosen ML algorithm and minimize the associated cost function.

A mix of ML algorithms can be very useful everytime you have to “optimize” something.
Minimize a cost function means, in fact, optimize and we have already said that ML is based on mathematical optimization, operation research and linear programming, disciplines strictly correlated to the concept of “optimization”.

So even video encoding is a fertile field for ML-driven optimization. In video encoding, we have many independent variables (metrics that describe the features of the video, resolution, target quality, etc…) and the final objective could be (but not only) to minimize quality/bitrate ratio using the right encoding parameterizations.

In recent years Youtube and Netflix have used ML to achieve optimization of specific objectives in video encoding. In the case of Youtube, they have used NN to predict quantization levels that produce the desired target bitrate so to be able to obtain the performance of a dual pass encoding in a single pass. This is an example of optimization of the quality/speed ratio because in the Youtube’s scenario the huge amount of input videos determines a high cost of encoding that this approach tries to optimize.

Netflix has instead used ML (SVM in the specific case) to fuse the performance of elementary objective metrics in a unique reliable subjective quality estimation (VMAF metric, Video Multi-Method Assessment Fusion). VMAF has been used then as an enabling technology for other optimization processes.

 

Content-Aware to the next level: Perception-based encoding

 

In the last year, I’ve been involved in an extensive and on-going project of NTT Data that uses ML to optimize encoding. The objective of the project is to take Content-Aware encoding to the next level and be able to encode with a target perceptual quality on screens of different size. I already introduced this as a new emerging trend in a previous blog-post.

Instead of specifying a resolution and bitrate, like in a traditional encoding, now we can specify only the target perceptual quality (es: a MOS rate from 1 to 5) and the max size of the screen on which the video has to be watched. The ML-driven algorithm will determine the encoding parameterizations for each scene of the video to achieve the desired perceptual quality when watched on that target screen size. A high complexity scene will require a higher average bitrate while a low complexity scene will require a lower bitrate. But the actual value and many parameters will depend on input content metrics, target MOS and target screen size.

Such a level of optimization provides a way to minimize the bandwidth consumption using only the amount of bits necessary to achieve the desired level of quality across different screen sizes. At the same time, using advanced player’s heuristics is possible to exploit the VBR encoding produced in output to increase also the QoE during streaming, delivery in average an higher quality compared to traditional types of encoding (es: CBR o capped VBR with a target avg bitrate).

The project has required a massive campaign of subjective quality assessment performed on screens of various size. More than 14.000 quality rates related to human perception have been analyzed, enriched and used to train an ensemble of ML-algorithms. A variable set of elementary metrics (from 4 to 12) are used in different point of the project to characterize sources, encoded videos and codecs’ performance and form the vector of input features for the predictors.

The first working version of this system is going to be used by an important broadcaster in Europe and the results are very promising. For example, thanks to the training with perceptual ratings collected selectively on TVs/Tablets/Smartphones, the average bitrate of a typical TV series like Game Of Thrones with a target MOS of ~4.2 (good in a 1-5 scale) is just 350Kbps on smartphone, 900Kbps on Tablets and 2.1Mbps on TVs, down -64%, -50% and -30% respectively from the bitrates of the previous static profile.

 

Conclusions

 

ML is really a precious ally when developing optimizations in a wide range of scenarios. Previously I used to use empirical approximations that worked well but in a sub-optimal way. Now ML allows a better fitting even if it may require a considerable amount of data to work properly.

The next steps are to increase the accuracy and performance of the pipeline, but I’m also exploring the use of ML on the player side of the equation, to optimize even more also ABR heuristics and player’s logic.

 

 

Video Streaming Optimization Trends

Those who follow my Blog know that I’m specialized in optimization of video streaming. It’s a creative and challenging work, because efficiency in streaming is not only a matter of choosing the best codecs and protocols (because it is probable to have a very limited set to choose from) but it’s more important to have an open mind, a genuine passion for research and devotion to quality.

In short: It’s less important the tool you use than the optimized methods, expertise and  vision that guide that tool. michelangelo

With a bit of research and original approaches it is possible to achieve higher benifits from optimization of existing codecs than from adopting new codecs, especially when you optimize codecs, streaming protocols and playback strategies synergistically.

This is even more true today because we are experiencing a kind of “stall” in the adoption of HEVC, the current state-of-the-art codec, because of uncertainties linked to multiple patent-pools ambiguity and licensing cost. If HEVC continues to be hampered, the alternative will be to use AV1 but it is still under development and will require many years to spread across a fragmented market. I’m sure this scenario will change, soon or later, but that’s not an alibi to wait and continue to deliver unoptimized H.264 video (or VP9), even because the benchmark of the market, Netflix, it’s anything but motionless!

The benefits of that kind of synergic optimizations are also in the fact that they can be applied in large part to different codecs, so when a new efficient codec will become available, the same techniques will be adapted to the new comer and maybe new specific strategies will be invented.

Synergic Optimizations

I’ve already spoken about “adaptive encoding” logic in this blog post. Here I want only to summarize interesting trending techniques to optimize video streaming services that are gaining traction and that I’ve applied in recent years and I’m trying to improve continuously in my consultancies.

Complexity-Aware Encoding

It has many names. Netflix calls it “Per-Title Encoding”, other call “Adaptive Encoding” or “Content-Aware encoding”. I’ve discussed it extensively here. This approach to optimization has gained traction after the article of Netflix about Per-Title Encoding, but it’s a technique yet to be fully exploited.

Streaming services are quite different and there are various ways to setup a Complexity-Aware encoding pipeline. There are simple ways to extimate complexity but also more complex metrics that take into account multiple variables and models of the HVS (Human Vision System). With such refined approach it is possible to control the level of quality delivered by the encoder and therefore optimize the encoding for a specific purpose.

Such approch can be implemented inside a codec (in loop optimization), or applied externally performing accurate analysis upfront the final encoding (usable only  in vod encoding).

The more complex implementation is probably the one that uses Machine Learning to predict optimal parameterizations to achieve the desired result (I’ve worked on this for a client in the last 6 months. More on this in the coming posts…)

Complexity-Aware Delivery

This is a variation of the former. The logic is essentially the same but it is not applied to the encoding (a “standard” encoder can be used), but it’s applied at streaming protocol level. The manifests can be manipulated to obtain specific performances according to the analysis of the “cross” qualities you have in the entire ABR set.

For example, if a segment in an HLS set has a too high quality (measurable with traditional objective metrics or with ML-guided metrics) it is possible (but not exactly easy) to manipulate the manifest so to alter how the player navigate across the renditions.

This approach requires accurate setting of the encoder to produce the desired range of qualities and the delivery is optimized at protocol/manifest level. Not straight-forward as the former but still interesting.

Perception-Aware Streaming

I’m not sure, but maybe I’ve forged a neologism! Perception-Aware Streaming. This refined optimization technique is something I’ve played with for a while. The “streaming” in the name is used to indicate that the technique involves both encoding and delivery. “Perception-Aware” indicates that the encoding and the delivery is performed taking into account of perceptual fenomena, and in particular the angular resolution of HVS.

fig3

Again I’ve already introduced the concept in a previous post. Essentially in this technique we create a super set of renditions. A sub set is calibrated for big screens, another sub set for tablet/laptop and a final sub set is calibrated for small screen devices.

Leveraging on a simplified model of HVS, angular resolution and known minimum distance from the screen, it is possible to conceal artefacts and provide an higher sense of detail and at the same time reduces the average bitrate especially on smaller screens. We can mix that with a variation of Complexity-Aware encoding to obtain a highly optimized encoding pipeline.

It’s an interesting topic. I’d like to write more about it if I find some time. I don’t exclude that in the next months I could write whitepapers on this topic for a couple of my clients since I’m going to apply this technique again, but in a more evolved form.

Optimized Heuristic

Optimize encoding with the forementioned techniques leads very often to VBR encoding (capped, controlled in some more sofisticated ways or also unconstrained sometimes). Such files require dedicated heuristic to properly execute Adaprive Bitrate Streaming.

In recent times, more optimized and efficient heuristics have emerged compared to the traditional bandwidth-based heuristics. Buffer-based or hybrid heuristics allow a much better exploitation of bandwidth, more resilience to bandwidth fluctuations and can cope easily with VBR renditions. Stay tuned to know more about it.

Conclusion

Taken alone, each of that optimization schema provides interesting benefits, but the real gain is when you can optimize them synergistically. Those strategies are strictly correlated and can strengthen each other and enable new levels of efficiency.

For example, complexity-aware encoding and perception-driven streaming are more efficient when you can encode in VBR, but VBR encoding requires a player with custom and optimized heuristic, in return, a custom heuristic not only can cope with VBR but can also implements more efficient ABR handling and contributes greatly to maximization of QoE.

Does VP9 deserve attention – Part II

In the previous post of this 2 parts series, I have analyzed the technical features of the codec VP9 and concluded that, technically speaking, VP9 has the basis to compete with HEVC in terms of encoding efficiency.

But, you know, theory is a different thing than reality and in video encoding a big part of the final efficiency is in the encoder implementation more than in the codec specification. In this regard VP9 is not an exception and what I see from my tests is that vpxenc (the open source, command line encoder provided by Google) is not yet fully mature and optimized for every scenarios. I’ll discuss about this latest distinction more over.

Video Quality

VP9 specification has many features that can be used to enhance perceptual-aware encoding (like “segmentation”, to modulate quantization and filters inside frames according to perception of different areas of each frame). But those features are not yet used in vpxenc and this is clearly visible in the results.

At the beinning of 2015 I evaluated the performance of several H265 encoders for my clients and published a quick summary of the advantages and problems I found in (that time) HEVC encoders compared to optimized H264. The main problem that emerged in that evaluation was the inefficiency of “Adaptive Quantization” and other psycovisual techniques implemented in the encoders under test. The situation has partially changed for HEVC encoders during last year (thanks to better psycovisual encoding, especially for x265) but grain and noise retantion, especially in dark areas, is always a challenge for codecs exploiting big “transformations” like H265 and, indeed VP9.

Vp9 today shows the same inefficiencies of HEVC 1 years and half ago. It is quite good in handling motion related complexity, thanks to advanced motion estimation and compensation and reconstructs with high fidelity low and medium spatial frequencies, but has difficulties in retaining very high frequencies. Fine film grain disappears even at medium bitrates and the “banding” artifact is very visible in flat areas, gradients and dark areas even at high bitrates. In this regard H264 is still much better, at least at medium-high bitrates. Those kinds of artifact are quite common on Youtube because they are using now VP9 everytime they can, so try by yourself a 1080p or 2160p video on Chrome and take a look at gradients and shadows.

The sad thing is that common quality metrics like PSNR, SSIM (but also the more sofisticated VQM) are more happy with a flat encoding than with a psyco-visually pleasant, but not exact – encoding, and at the end, VP9 may be superior in PSNR or SSIM to H264/H265 even in a comparison like that of Picture 2 below where is very evident the banding or “posterization” effect.

banding1
Picture 1.   H265 vs VP9 vs H264 – 1080p @2Mbps – click to enlarge

banding2Picture 2.  VP9 vs H264 – 1080p @2Mbps – click to enlarge

VP9 profile 2 – 10bit per component 

Until now I’ve spoken about traditional 8bits/component encoding in H264, H265 and VP9. But vpxenc supports also a 10bits per component encoding known as VP9 profile 2.

Even if your content is at 8bit and everything remains BT.709 compliant, several studies has demonstrated that 10bit encoding is always capable of better quality/bitrate ratios thanks to higher internal accuracy. In particular the benefits are well visible in gradients and dark areas’ accuracy. See this example of VP9 8bit vs 10bit:

10bitPicture 3.  VP9 (8bit) 1080p@2Mbps vs VP9 (10bit) 1080p@1Mbps – click to enlarge

In the picture above we can see the better rendering of soft gradients when encoding at 10bits even if the source is 8bits. Grain (high freq, low power signal) is still not retained compared to the source but banding is pretty much reduced. Note also that in the case of VP9 profile 0 we need to increase the bitrate well above 3Mbps to have a good encoding of gradients (for 1080p) while at only 1Mbps the result is in this case sufficient when using profile 2.

The superiority of 10bits encoding has been always valid also for H264 (high10 profile), so why 10bits have started to gain momentum only with HDR and not before ?

The answear is “lack of players” on consumer’s devices. Let’s remember that H264 has become relatively early the standard in internet video only because Adobe decided to insert (at it’s own expense) a decoder inside Flash Player 9 (2007). This enabled a billion desktops to playback baseline, main and high AVC profile. Few know that originally it should support also high10 but a bug ruined the opportunity to actually use this function.

Apart this missed opportunity, H264 decoders on modern browsers, mobile devices, TVs, STBs are not capable to decode H264 high10 profile and the same is true for VP9.

Where is VP9 available now ?

Today VP9 is supported in lastest Chrome, Firefox, Opera (and Edge in preview) browsers on desktop (PC and Mac) and is supported in Android from version 4.4 on (software or hardware decoding depending by device). It is also available on an increasing number of Connected TV, but all the current (significative) decoders support only VP9 in mode 0, so 8bit.

The same problem is true for H265. On the mobile devices that support it, you can only deliver 8bit H265, but in this case it is also true that the large majority of 4K TVs support HEVC main10 profile as well.

So, when is convenient to use VP9 ?

The problem of “banding artefact” is directly proportional to the size of the display. It is irrelevant on small displays like that of smart phones and tablets. On laptop it starts to become visible and is pretty bad on big TVs.

So, concluding, I think that today VP9 is an interesting option for everyone who wants:

– The maximum quality-bitrate ratio on desktop even with some compromises in terms of quality. HEVC decoding will probably not appear on desktop for a long time, so VP9 is the only viable improvement over H264. The use case of live streaming can better fit the compromises.

– High efficiency on Android with a wide support base (Android >4.4). On an old, 100$ Android Phone I have, VP9 decoding works and HEVC not. Interesting option for markets of developing countries when bandwidth is scarce and Android has a bigger base than iOS.

If the current situation doesn’t change I doubt that players like Netflix will deliver high quality content on Desktop or TV using VP9 in profile 0, especially for 4K. And infact David Ronca of Netflix has said that they are evaluating VP9 especially to lower the level of access for mobile devices (they already use HEVC for HDR-10).

But fortunately the scenario is probably about to change quickly if it’s true that Youtube is planning to deliver HDR (=10bits) with VP9 during summer. This means that TVs with Vp9 profile 2 decoding capabilities are becoming a reality and this should open the way also for profile 2 on desktop browsers. In this case (and I’m optimistic), VP9 has really good chances to definitively become the successor of H.264 at least for Internet Video on Desktop and Android.

Remain to see what Apple will decide to do. In the while I’m starting to push VP9 in my strategies because Indeed I think that their choices are irrelevant. If we want to optimize a video delivery service it is increasingly clear that we will have to optimize for all 3 codecs.

Does VP9 deserve attention ? – Part I

A technical primer

VP9 is a modern video codec developed by Google as the successor of VP8. While VP8 was aimed at offering an open alternative di AVC (aka H.264), VP9 challenges the latest HEVC (aka H.265). Google follows with VP9 the same model of “open” codec used for VP8 (the fact to be really open and free from patents related threats is still object of debates) and this theoretically makes of VP9 an interesting alternative to HEVC which is burdened by unclear and unsettled claims by multiple patents holders and patent pools.

VP9 specification has been freezed in June 2013 but only recently it is starting to attract attention of players that want to optimize video distribution (Youtube has been the only big adopter during last year, but now also Netflix is evaluating to use it). This is because VP9’s and HEVC’s ecosystems have finally reached a minimum level of maturity and is now possible to do evaluations and comparison with a sufficient level of confidency.

In this short serie of blog posts I analyze VP9 and try to understand if it really deserves attention and why. In this first part we will take a look at the technical specifications compared to HEVC (analyzed in this previous post) and in the second part I’ll analyze the actual performances, limits and contexts in which is possible to use VP9 as a valid alternative to AVC or HEVC.

Picture Partitioning

VP9 subdivides the picture in “super blocks”. Similarly to HEVC, in VP9 super blocks can be recursively divided in smaller blocks down to 4×4. Differently from HEVC that can subdivide only in square sub partitions (32×32, 16×16, 8×8) VP9 can also use not square partitions like 32×16, 8×16 and so on (the use of rectangular partitions stops subdivision in the quad-tree branch). Most decisions are taken at level 8×8 (“skip” signaling for example) and 4×4 is a special case of 8×8. prediction mode, reference frame, MV, transform types are specified at block level.

partitions

Entropy coding

Like VP8, VP9 uses an 8bit arithmetic coding engine known as the bool-coder. It use a static per-frame statistical model compared to an adaptive stat model like cabac used in AVC/HEVC. For each frame, the more convenient statistical model is choosen from a pool of four.

Residual coding

Similarly to H265, VP9 uses 4 transform sizes: 32×32, 16×16, 8×8 and 4×4. Transformations are integer approximations of DCT (Discrete Cosine Transform) or DST (Discrete Sine Transform), a mix of the two are used depending by the type of frame and transform size. Coefficients are scanned with particular patterns (different from the zig-zag patterns of H26x codecs, but with the same logic).

Quantization

VP9 uses 4 scaling factors: a couple for Luma DC and AC coefficients, and a couple for Chroma DC/AC. The set of quantizers are fixed at frame level, so there is no block-level QP adjustment contrary to AVC/HEVC (but the not mandatory feature “segmentation” should be able to achieve the same effect of an adaptive quantization).

VP9 supports also a special lossless mode that uses only a Walsh transform on 4×4 blocks.

Intra-prediction

Intra prediction is a bit less complex than what offered by HEVC. Intra prediction acts on transformation blocks and there are 8 directional prediction modes and 2 not-directional compared to the 35 modes of HEVC

intrapred

Inter-prediction

VP9 uses 1/8th pel motion compensation (double the precision of AVC). A novel feature is the possibility to use normal, smooth or sharp 8th pel interpolation filter (+bilinear). The proper version of the filter can be changed at block level.

Because of patents VP9 doesn’t use bidirectional motion estimation and compensation, so each block has normally only a single forward motion vector. However VP9 uses  “compound prediction” where there are two motion vectors and the two predictions  averaged together. To avoid patents, “compound prediction” is enabled only on not visible frames (commonly referred as “AltRef”). AltRef can be “constructed” during decoding, are not visible but can be used later as references. Since it’s possible to anticipate in an AltRef a future frame and use it as reference in compound mode, VP9 officially has no B-frames but in fact it has something completely equivalent.

Motion vectors in a frame can point to one of three possible reference frames usually named Last, Golden and AltRef. Ref frame to be used is signaled at 8×8 granularity. The decoder holds a list of 8 reference frames (slots) from which Last, Golden and AltRef refs are choosen at frame level. After decoding, the current frame can (optionally) substitute one of the 8 slots in the pool. An interesting feature of VP9 is the possibility to scale down frames during encoding (not on iframes). Inter predictors and reference frames are scaled accordingly.

Motion vector prediction is similar in complexity to HEVC. A 2-entry list of predictor is build during encoding and decoding. The first predictor is based on surrounding blocks, the second on previous frame. In case of empty list a vector 0,0 is used. So for each block the bitstream can signal to use:

-the first predictor plus a delta
-the first predictor as is
-the second predictor as is
-simply use motion vector [0,0]

Loop Filter

There are 3 possible filters at different strength. VP9 makes a flatness test at boundaries of blocks and if the result is higher than a threshold, one of filter is applied to conceil blockiness.

Segmentation

Segmentation groups together blocks with similar characteristics. It is possible to change some encoding techniques at group level. This feature is dedicated to implement encoding optimizations (including psycovisual optimizations) and require an active support in the encoder.

Profile

The standard VP9 (profile 0) supports only a 8bit – 4:2:0 color mode while the (optional for hardware) profile 1 supports also 4:2:0 / 4:4:4 and optional alpha. In August 2015 Google has released a new version of the reference encoder capable to support the new profile 2 profile 2 (10-12bit -4:2:0) and profile 3 (10-12bit -4:2:2 / 4:4:4 + alpha). Profile 2 is aimed at supporting HDR video in Youtube (expected for summer 2016).

VP9 compared to HEVC

From a technical point of view, VP9 appears to be very near to HEVC as potential efficiency. The actual performance depends by the efficiency of the real encoders, but VP9 has all the potentialities to reach (almost, see below) the same performance of HEVC.

VP9 is a bit sub-par in terms of intra frame prediction (less modes) and of entropy coding (static tables vs adaptive). HEVC appears also to have an higher number of modes and small strategies to reduce the cost of syntax and signaling as well as residuals but on the other end, VP9 has some interesting potentialities in psycovisual optimizations and rate-control thanks to segmentation and adaptive frame resolution.

We will see in the next post the level of efficiency now reached by VP9 encoder compared to AVC and HEVC and the level of maturity of the respective ecosystems.

 

Video Optimization – A matter of Adaptivity


Online Video: infancy, youth and maturity

Over the last decade the consumption of online video has undergone an exponential growth, but online video is as old as the Internet itself. Recently Dan Rayburn has published a blog post about the early history of the streaming media industry, an “era” (1995-2005) where pioneers started experimenting codecs, products and models for the distribution of video over the Internet.

But it’s only with the launch of Youtube (2005-2006) that online video started a really tumultuos growth to become the preminent portion of global IP traffic. The ride of online video has been so intense that today the traffic generated by video is more than 70% of the total Internet traffic, orders of magnitude higher than 10 years ago (and still growing…).

We can say that nowadays  online video has entered a phase of maturity. It is a multi-billion business ran not only by giants tech companies like Youtube, Netflix, Facebook, Amazon, Hulu, Apple, Vevo, but also by a multitude of traditional broadcasters (BBC, HBO, Sky just to name a few) with their regional OTT services.

The pressure of competition is now really high and this will bring many benefits to end users on many fronts, even that of QoE’s optimization.


Why optimize video streaming ?

Infact, until very recently, no one really cared about video optimization. Like any business in its early stages it was more important to place on the market the right product (and then find a viable business model before running out of money) than anything else including optimization of QoE. Simply put: If it worked it was enough.

But now things have changed. It cannot simply “work”, user expectations are constantly growing and it’s increasingly harder to engage users (see graph below). In this scenario optimization of streaming is becoming a key technological factor to differentiate a service from competitors, increase the satisfaction / retention and reduce costs.

conviva 1_CSR2015_HowConsumersJudgeTheirViewingExperience

conviva 2_CSR2015_HowConsumersJudgeTheirViewingExperience

Source: Conviva CSR2015 -How Consumers Judge Their Viewing Experience

How to optimize ?

If it’s clear what are the reasons to invest in streaming optimization on the other hand it’s not so easy to find the right way(s) to accomplish it. Users  push the play button and want only to watch their preferite video flowlessly. But we know that behind-the-scenes there’s a lot of work to do to maximize that user experience. It’s a tangle of codecs, streaming protocols, multiple DRMs and CDNs, advertising, interactions flows, personalized experieces and so on.

At the end of the story, users want the max possible quality through out the video, a fast start and zero rebuffering on every screen. It’s up to us to untangle the skein and fulfill those expectations.

The points to be optimized are many but, in my opinion, the three more important are:

1. Video encoding optimizations (Quality)
2. ABR streaming optimizations (Robustness of distribution)
3. Playback optimizations (Reliability of streaming, start time, other aspect of QoE)

I have touched those points many times in the last 8 years in several projects (optimization of encoding pipelines and/or codecs, optimization of streaming protocols and servers, optimization of players) or during conferences (see Adobe Max 2009 / 2010 / 2011) and I’ve made “online video optimization” one of my distictive competencies.

In general, the matter is complex, the variables are multiple and there are also many boundary conditions so there’s no single recipe. Maximize the QoE requires the coordination of “optimization campaigns” in each of the aforementioned areas.

This requires flexible instead of static approaches, open-mindness instead of dogmas, desire for excellence (both for consultant and customer, paradoxically not so common to find in the latter), but also a mix of scientific approach and inspiration, remembering always that success is in the detail.

Create coordinated optimization strategies in encoding, delivery chain, and players is very complex so in this article I want to talk mainly about encoding optimization. This topic has  become hot recently because of this post on the Netflix’s blog. They call it “Per-Title Encode”, I call it “(Content) Adaptive Encoding”.

I have worked on this topic for many companies like for example NTT Data, Sky Italy, Intel Media (acquired by Verizon), EvntLive (acquired by Yahoo!) and lately Vevo. I recently co-authored this article on Vevo’s tech blog on how we have optimized encoding of 200.000+ videos in Vevo during 2015. I suggest to read that article to have an high level introduction of the next topic: Content Adaptive Encoding.


Adaptive Encoding

“All fixed set patterns are incapable of adaptability or pliability. The truth is outside of all fixed patterns”  Bruce Lee

Encoding Video is a very complex process.There’s often the temptation to over-simplify complex things and encoding is not an exception. So usually everyone encode video with a predefined set of parameters that satisfy some requirements (usually quality and/or target bitrate). But why should we use a single set of parameters (resolution, bitrate, encoding profiles) when we have very different kind of video and/or playback conditions ?

Static solutions to complex problems are rarely capable to produce best results. If we have mutable conditions and mutable data we need to adapt to them if we want to get closer to the optimal solution.

To exemplify the concept let’s make a parallelism with the problem of “function approximation”. If we need to approximate an arbritary function (see picture below), how can we hope to have a useful solution using a single 0-order approximation (red line on the left) ? It is too coarse, and the error that we get using it is very high (at least in some situations, i.e. for x -> 0). It’s clear that a first order approximation would be better (green line on the left) but still sub-optimal. Like in many other situations it’s even more useful to partition the problem in smaller (simpler) ones, in this case also a set of simple 0-order approximations (red lines on the right) would be considerably better at estimating the function than the original, ultra simplified approach, not to mention a “set” of first-order approximations (green lines on the right).

The partioning of the problem’s domain helps to avoid over-simplifications

approximatorsMaking a parallelism between this problem and the encoding, approximate with a 0-order estimator is similar to encode everything with the same resolution-bitrate “mix” (a.k.a ABR ladder).

The one-fits-all solution is simple, but far away from being optimal. We must be “Adaptive” in the sense of elaborating dynamic strategies to optimize the system.

There are many ways to optimize encoding but my preferite is, like said above, to partition this multi-dimensional problem in to sub-domains or clusters. We have not to apply necessarily rigorous math, it’s often more a matter of common sense. If we have a complex problem, let’s try to break it down to simplier pieces, easier to solve.

For example, in the case of encoding for ABR, we have commonly video with different complexities (a first variable to analyze) and we watch video on different devices (a second variable to take into account). A static ladder (for ABR streaming) is usually designed for the worst case and like a 0-order provides a sub-optimal performance.


Complexity-Aware encoding

We know that low complexity videos (like talking heads or fixed camera videos) are indeed much easier to encode than complex videos (like sports or action movies). This is inherently related to the way modern codec compress video data. They exploit temporal and spatial redundancies. Simple motion can be predicted from past frames and high spatial frequencies are stripped away by quantization.

A low complexity content can be compressed much more than a complex one, and this with approximately the same perceptual quality.

This is a first partition we can apply to the problem. Let’s classificate the content according to the complexity and apply specific encoding setups to optimize the overall performance toward desired goals.

Do you want to save bandwidth globally ? Why not encode content at different bitrates according to their complexities ? You will have a consistent perceptual quality but savings in bandwith consumption, globally.

You want higher average quality ? In this case, let’s encode simplier content at higher resolutions compared to the resolution we would use using a single, static setup that’s usually calibrated on the worst case (which is high complexity).

med
Medium Complexity (click to enlarge): 540p@2.4Mbps (left) vs 720p @2.0Mbps (right)

Finding the right recipe is not easy because things may get more complex if we go down in this process. For example, complexity is not a scalar property of a video but a local attribute (complexity can change frame by frame, or at least scene by scene). If we join this with the fact that we may have constraints set by other elements of the pipeline the logic with which we try to approximate the optimal solution may become complex.

Just to make an example, in ABR streaming we are usually forced to encode video in capped VBR (if not CBR) because of player’s heuristic (this is why I’ve said before that the “final” optimization would be to set coordinated optimization strategies for encoding, distribution and playback. You need usually an optimized player to handle VBR encodings).

So to improve the optimization level, we may need to consider not only the average complexity, but also the maximum complexity through-out the video and apply dynamic parameterizations accordingly. Furthermode, complexity may be spatial (high frequencies in the image due to nitid picture or noise) or temporal (high level of motion, more difficult to encode for traditional codecs based on motion estimation and compensation). Different complexities deserve different weights inside our “optimization function” and specific  parameterizations.


Viewing Context-aware encoding

Another variable is represented by the viewing conditions. Why apply the same resolution-parameterization for the same level of bandwidth when the video is watched on quite different screens ? The human eye has a specific angular resolution, so small defects in the picture quality are not visible at high DPI (like that of a smartphone) while the same is not true for low DPI screens like that of a TV. Mix that with the variable distance of viewing and we have another set of variables that we can optimize encoding for.

dpi

Example of different sensitivity of vision. The pictures above simulate the playback of the same video at different screen sizes: approx a smartphone screen for the upper image and a tablet (double diagonal) in the lower, cropped image. The picture is the same, simply enlarged. Note that artefacts of encoding are very visible on the lower image, but much less in the upper.

Considering the different sensitivity to artifacts of the eye at different DPI we can optimize the ABR ladder with resolutions-bitrates-parameterizatons specifically choosen to conceal artifacts in specific viewing conditions.

Closing Notes

There are other interesting aspects that enter in the mix of strategies that you can use.

I have no time to analyze them here, but they worth a mention:

Multi-Codecs encoding: leverage the best codec available on each platform. ie. VP9 on Android / Chrome / FF, HEVC on 4K TVs and H264 every where else.

VBR vs CBR: use VBR whenever possible. This requires custom player so i.e. is feasible today in DASH for Android and Browser but not for HLS in iOS. Will require multiple encodes but may worth the effort.

– Another interesting topic is the distance and number of renditions inside an ABR ladder. Different network conditions (i.e. mobile vs broadband) may require different setups.

Special renditions: sometimes I have defined special renditions for special cases that may have specific goals and characteristics (i.e. special renditions to speed up initial buffering efforts).

Concluding, if we mix various strategies, the improvement in QoE and bandwidth consumption may be considerable. Consider that optimize quality/bitrate ratio generates always an increase in QoE both directly and indirectly. Infact, with giants like Netflix that monopolizes the bandwidth (40+% of Internet traffic in USA at peak times) the services that are not optimized will start to suffer (or probably are already suffering). ABR streaming cannot be used any longer an “alibi” for un-optimized encoding, it’s no longer sufficient to be in the market, you’ve to master technology, smooth edges and give the maximum to be competitive. It’s time to optimize.

 

 

 

Future of video: 4K, DASH, HEVC

I must admit, I’m feeling very guilty. This is the only new post in more than 1 year. 2013 has been wonderful from a professional point of view and I have had very few moments, if any, to dedicate to the blog. But for 2014 there are too many interesting trends that I can’t neglect anymore and so I want to return speaking about video encoding, streaming and OTT technologies.

Infact, you know that there are three magic “words” that are outlining the future of video: 4K, HEVC and DASH.

So, as a 2014 new year resolution, I’m planning to speak about ideas and optimizations related to the “magic trio”.

4K or not 4K ?

The first trend is rapidly gaining its momentum. “4K” is on every insiders’ lips and the effort of Youtube, Netflix and others to offer quickly 4K content is also opening new opportunities for selling 4K TVs and Monitors.
I’m focusing part of my researches in finding specific optimizations for H.264 encoding of 4K content. Infact I think that apart from marketing buzz, 4K will be served first using the well known H.264.

There are sereval optimizations to explore for 4K: for example custom quantization matrix, bias toward the use of 8×8 transform, changes in psyco visual optimizations, to name a few. 4K also pushes the limit of H.264 for motion compensation and estimation (too long MVs) creating several efficiency problems. But if is useful to optimized an HD and FullHD stream, it is much more crucial to super optimize a 4K stream because the level of bitrate that we are speaking about is difficult to have in Internet or to have consistently.

ABR streaming can help here but not as usual. Who can accept to watch a 2.5Mbit/s 720p rendition on a 80” 4K display because of low bandwidth on peak times ? (it is the same experience as watching a 360p video on a 40” screen from 1.5 mt of distance, try and tell me) Who buy a 4K wants 4K, no compromise. Further more, as Dan Rayburn underlined, there are few economic reasons to offer 4K because 4K delivery costs 3-4 times Full-HD. This is why I think that optimization is now more important than ever.

HEVC

HEVC has been finally ratified. Like in 2003, when H.264 was ratified, now the encoders are very raw and inefficient and a lot of work is to be done, but the potentialities are all there. Theoretically HEVC is said to be from 30 to 50% more efficient than H.264 (higher efficiency at higher resolutions). So it is not a mistery that 4K and H.265 are seen as the winning couple. But the increase in pixel to be processed (8x passing from 1080p25/30 to 2160p50/60) and the complexity of the new codec (approx. 10x during encoding compared to H.264) do not draw a simple scenario with increses in required processing power up to a 80x factor. But hey…we are now like in 2003, we have maybe 10 years ahead to squeeze the max out of H.265, and this is very exciting. In thee while, H.264 still have some room for improvements and for at least a couple years will continue to be the king on the hill.

I have started to play with HEVC and probably the amount of time I’ll dedicate to experiment will increase steadily during 2014. By now I have collected interesting results. The bigger Block Transforms (not only 4×4 and 8×8 like in H.264 but also 16×16 and 32×32) plus some advanced deblocking  and adaptive filtering are able to produce a much “smoother degradation” of quality when decreasing the bitrate, especially for high complexity scenes. On the other hand, the different handling of fine details is producing now less details retantion than H.264 and new approaches to psycovisual optimizations are all to be invented.

And VP9 ? Interesting technology, good potentiality. Will be successful? Hard to tell, until then I will continue to keep it under observation.

DASH

Last but not least there’s the new MPEG standard for ABR streaming MPEG DASH (Dynamic Adaptive Streaming over HTTP). HLS is spreading over various devices but at the same time the implementations are frequently bugged and without control. DASH on the other hand provides plenty of control and it is possible to change heuristic. This is very important to achieve an Higher-as-possible QoE (or QoS), a key factor in the future where CDNs’ cost per GB is flattening while viewers’ number and stream size/quality is increasing .

So stay tuned.