HYPER: a decade of challenges and achievements.

Who follows me since the years of Flash Player and Adobe Media Server knows that I’ve been busy in the last 15 years developing encoders, players and in general software architectures to enable, enhance and optimize video streaming at scale. I’ve achieved many professional successes working on innovative projects for companies like NTT Data, Sky, Intel Media, VEVO and many others. In such contexts I’ve had the opportunity to meet inspiring people: managers, engineers, colleagues and ultimately friends that helped me in growing as an engineer and as a video streaming architect.

In particular this 2021 I celebrate the 10th anniversary of Hyper, one of those achievements, but let’s start from the beginning:

Conception

In 2008 I started collaborating with Value Team, a leading system integrator in Italy (later acquired by the global innovator NTT Data). The BBC’s iPlayer was just released and media clients started asking for something similar, so NTT Data contacted me to design a high performance platform (encoder and player) for the nascent market of catch-up TVs and OTT services. The product, VTenc, powered the launch in 2009 of the first catch-up tv in Italy (La7.tv owned by Telecom Italia).

The most innovative feature of VTenc was the possibility to encode a single video in parallel, splitting it in segments then distributed on a computing grid for parallel encoding. The idea emerged after a discussion with Antony Rose and his team (the creators of BBC’s iPlayer) where they underlined that one of the main problems in encoding for a catchup tv was the long processing time that delayed the distribution of the encoded stream after the conclusion of the show on tv.

A few months and many technical challenges later, the feature was ready. And the idea of the parallel encoding was successfully applied to La7.tv: we received the live program in “parts”, emitted every time there was an advertisement slot. Each part, usually 20-25 min long, was diveded into smaller chunks, encoded in parallel and then reassembled and packetized with a map-reduce style paradigm. Also thanks to a client side playlist, the final result was ready for streaming after just 10 minutes from the conclusion of the live show.

An incredible result for that time because commercial encoders required many hours for an accurate 2-pass encoding of assets 2-3 hours long. It was also one of the first implementations in absolute for adaptive bitrate streaming in Flash (Custom implementation in Flash Player 9 + AMS when Adobe introduced officially adaptive bitrate only in Flash Player 10).

The birth of Hyper

VTenc was improved in the following years until in 2011 NTT Data signed a deal with Sky Italy to provide the encoder for the new incoming OTT services of the broadcaster.
We implemented new features and they choose VTenc for a variety of key points:

– flexible queue management system
– rapid customizability
– high video quality
– high density and scalability
– short time to market.

VTenc evolved into something more complex, NTT Data Hyper was born. The model was something different than buying a commercial encoder that usually had long evolution cycles and well defined but unflexible road maps. Hyper has been something more similar to a focused, tailor made and optimized encoding engine like those build by Netflix or Amazon. Often also a sandbox where to conceive and test new technologies and ideas.

In this 2021 we celebrate the first 10 years of Hyper.

A decade of innovative achievements and milestones

Since then we have reached many achievements and milestones facing the challenges of the last decade. Looking back, it has been an exciting journey, professionally intense and enriching. Some achievements that deserve to be mentioned:

– A long story of content-aware encoding approaches from the first empirical versions (2013) to the use of ML to implement peculiar “targetMOS” and “targetDevice” encoding modes (2017+). I’ve been always a fan of “contextual optimizations” and started my experiments in the years of Flash Video, presenting some initial ideas at Adobe Max 2009-11 but then I’ve had the opportunity to implement those paradigms in the industry in various “flavors”.

– Internal caching logic for elementary streams that allows the quick repurposing of previously encoded assets without executing a new expensive encoding. With this logic we have been able many times to repurpose libraries (tens of thousands assets) in a matter of days. In this way, add a new audio format, change ident, parental or other elements in the content playlist, or add a new packetization format (es: new version of HLS) has been always quick and inexpensive.

– In 2016 Hyper evolved from a grid computing to a hybrid cloud paradigm with on-prem resources that cope with baseline workload and cloud resources that satisfy peaks. Having thought the software since the beginning around agnostic services and flexible work queues, the hybridization was a natural step. Resources can be partitioned to have maximum throughput and cost efficiency on some queues as well as minimum time-to-output on others.

– In this last context my team designed a 2-step technique to generate on the fly a smart mezzanine with controlled perceptual quality to quickly and conveniently move very big high quality sources to the cloud for parallelized encoding (with a bandwidth reduction of up to an order of magnitude).

– Full cloud deployments on AWS and GCloud that quickly and dynamically scale from just 2 on-demand instances to thousands spot instances (“elastic texture”) with optimized scaling logics to minimize infrastructure costs and provide higher reactivity than standard scaling systems like autoscaling groups.

Now that Hyper turned 10 and after various millions of encoding jobs, in 2021 we are going to finalize Hyper v2 and tackle new challenges (VVC, AV1, complete refactoring, perceptual-aware delivery, agnostic architecture to apply massively parallelized processing to other contexts), but that’s the matter for an entirely new story…by now let’s celebrate:

happy 10th birthday Hyper!

Thoughts around VMAF, ContentAwareEncoding and no-ref metrics

Introduction

Content-Aware Encoding (CAE) and Context-Aware Delivery (CAD) represent the state-of-the-art in video streaming today, independently from the codec used. The industry has taken its time to metabolize these concepts but now they are definitely mainstream:

Every content is different and needs to be encoded differently. Contexts of viewing are different and need to be served differently. Optimization of a streaming service requires CAE and CAD strategies.

I’ve discussed several times these logics and the need for CAE and CAD strategies and I’ve implemented different optimizations for my clients during the years.

Speaking about Content-Aware Encoding, at the beginning we used empiric rules to determine a relationship between the characteristics of the source (eventually classified) and the encoding parameterization to achieve a satisfying level of quality at the minimum possible bitrate. The “quality metric” used to tune the algorithms was usually the compressionist’s perception (or that of a small team of Golden-Eyes) or more rarely a full-featured panel for subjective quality assessment. Following a classical optimization approach (read some thoughts here) we subdivided a complex “domain” like video streaming in subdomains, recursively trying to optimize them individually (and then jointly, if possible) and using human perception tests to guide the decisions.

More recently, the introduction of metrics with high correlation with Human Perception, like VMAF, have helped greatly in designing more accurate CAE models as well as in the verification of the actual quality delivered to clients. But, are all problems solved ? Can we now completely substitute expert’s eye and subjective tests with unexpensive and fast objective metrics correlated to human perception ? The answer is not simple. In my experience, yes & no. It depends on many factors and one of them is accuracy…

A matter of accuracy

In my career I’ve had the fortune and privilege to work with open-minded Managers, Executes and Partners who dared to exit their comfort zone to promote experiments, trials and bold ideas for the sake of quality, optimization and innovation. So in the last decade I’ve had the opportunity to work on a number of innovative and stimulating projects like:
various CAE deployments, studies on human perception to tune video encoding optimizations & filtering, definition of metrics similar to VMAF to train ML algorithms in most advanced CAE implementations and many others. In the continuation of the post I’d like to discuss some problems encountered in this never-ending pursuit for an optimal encoding pipeline.

When VMAF was released, back in 2016, I was intrigued and excited to use it to improve an existing CAE deployment for one of my main clients. If you can substitute an expensive and time consuming subjective panel with a scalable video quality tool, you can multiply the experiments around encoding optimization, video processing, new codecs or other creative ideas about video streaming. A repeatable quality measurement is also useful to “sell” a new idea, because you can demonstrate the benefits it can produce (especially if the metric is developed by Netflix and this brings immediate credit).

However since the beginning VMAF showed in my experiments some sub-optimal behaviors, at least in some scenarios. In particular, what I can even now recognize as the Achille’s heel for VMAF is the drop of accuracy in estimating perceptual quality in dark and/or flat scenes.

In CAE we try to use the minimum possible amount of bits to achieve a desired minimum level of quality. This incidentally brings to very low bitrates in low complexity, flat, scenes. On the other hand, any error in estimating the level of quantization, or target bitrate in such scenes may produce an important deterioration of quality, in particular may introduce a amount of “banding” artifact. Suddenly, a point of strength of CAE becomes a potential point of weakness because a standard CBR encoding could avoid banding in the same situation (nervertheless with a waste of bitrate).

Therefore, an accurate metric is necessary to cope with that problem. Banding is a plague for 8- bit AVC/HEVC encoding, but can appear also in 10-bit HEVC video, especially when the energy of the source is low (maybe because of multiple elaborations) and a wrong quantization level can completely eliminate higher, delicate, residual frequencies and cause banding.

If we use a metric like VMAF to tune a CAE algorithm we need to be careful in such situations and apply “margins” or re-train VMAF to increase the sensibility in such problematic cases (there are also other problematic cases like very grainy noise, but in those I see an underestimation of subjective quality, which is much less problematic to handle).

I’m in good company in saying that VMAF might be not the right choice for all scenarios because even YouTube in the Big Apple 2019 Conference pointed out that VMAF is often not able to recognize properly the presence of banding.

youtube_band — **Figure 1.** VMAF overestimates quality on dark, flat, scenes

I could hypothesize that this behavior is probably due to the way quality has been assessed in VMAF, for example the distance of 2.5xH could reduce sensibility in those situations, but the problem is still present also in VMAF 4K where distance is 1.5xH so maybe is a weakness of the elementary metrics.

A case in 4K

Let’s analyze a specific case. Recently I’ve conducted a Subjective Quality test on 4K contents, both SDR and HDR/HLG. VMAF 4K is not tuned for HDR so I’ll limit my considerations to the SDR case. The subjective panel has been performed to tune a custom quality metric with support for HDR content that then has been used to train an ML-based CAE deployment for 4K SDR/HDR streaming.

The picture below shows a dark scene used in the panel. On the left you have the original source, on the right you have a strongly compressed version (click on picture to enlarge).

**Figure 2.** Source (left) vs Compressed (right). Click to Enlarge

**Figure 3.** Exalted gamma to show artifacts on encoded version. Click to Enlarge

In Figure 3 you can easily see that the image is very damaged. It’s full of banding and also motion (obviously not visible here) is affected, with “marmalade” artifact. However, VMAF reports an average score of 81.8 over 100, equivalent to 4 in 1to5 scale MOS, which overestimates the subjective quality.

The panel (globally 60 people, 9000+ scores , 1.5xH from 50” 4K display, DSIS methodology) reports a MOS of 3.2 which is still high in my opinion, while a small team of Golden EYE reported a more severe 2.3.

From our study, we find that variance in the opinion scores for such type of artifacts increases considerably, maybe because of different individual visual acuity and cultural aspects (not trained to recognize specific artifacts). But a Golden Eye recognizes immediately the poor quality and so also an important percentage of the audience (in our case 58% of the scores were 3 or below) will consider the quality not sufficient, especially for the expectation of 4K.

This is a classical problem of taking into consideration the mean when variance is high. VMAF provides also a Confidence Interval, that’s useful to take better decision but still the prediction has an overestimated “center” for the example above and at least 2 JND distant from the MeanOpinionScore (not to mention Golden Eye’s score).

Anyway, below we can see the correlation between VMAF 4K and subjective evaluation in a subset of the SDR sequences. The points below the area delimited by red lines represent content in which the predicted quality is overestimated by VMAF. Any decision taken using such estimation may lead to a wrong decision and some sort of artefacts.

vmaf4k_scatterplot1 — **Figure 4.** MOS vs VMAF 4K

Still a long journey ahead

VMAF is not a perfect tool, at least not yet. However, it has paved the way toward handy estimation of perceptual quality in a variety of scenarios. What we should do probably is to consider it for what it is: an important “step” in a still very long journey toward accurate and omni-comprehensive quality estimation.

For now, if VMAF is not accurate in your specific scenario, or if you need a different kind of sensitivity, you can re-train VMAF with other data, change/integrate the elementary metrics or make your own metric that focuses on specific requirements (maybe less universal but more accurate in your specific scenario). You could also use an ensemble-like approach, mixing various estimators to mitigate the points of weakness.

I see also other open points to address in the future:
– better temporal masking
– different approach to pooling scores both in time and spatial domain
– extrapolation of quality in different viewing conditions

As a final consideration, I find YouTube’s approach very interesting. They are using no-reference metrics to estimate the quality of source and encoded videos. No-reference metrics are not bound to measure the perceptual degradation of a source-compressed couple of videos, but are designed to estimate the “absolute” quality of the compressed video alone, without access to the source.

I think they are not only interesting to estimate quality when the source is not accessible (or is costly to retrieve and use), like in monitoring of existing live services, but they will be useful also as internal metric for CAE algorithms.

In fact, modern encoding pipelines try often to trade fidelity to the source with “perceptual pleasantness” if this can save bandwidth. Using a no-reference metric instead of a full-reference metric could increase this behaviour similarly to what happened in super resolution passing from a more traditional cost function in DNN training to an “adversarial-style” cost function in GAN.

But this is another story…

“Time Machine” – my talk at Demuxed 2018

I’ve just returned from a wonderful experience at Demuxed 2018.

I have had the honor to participate as a speaker alongside professionals from Twitter, Netflix, Youtube, Twitch, Comcast, Intel, Mux, Bitmovin, Akamai … and in general, the experience as both attendee and speaker has been amazing.

The event was streamed live on Twitch but today have been released also the individual VoD recordings (sessions list), including mine.

My session is:

“Time Machine” – how to reconstruct perceptually, during playback, part of the detail lost in encoding.

posterTM

In the last years, I’ve focused my efforts on “joint” optimization of various elements of the streaming pipeline. Evolving from an intra-domain to a inter-domains optimization approach, it is possible to squeeze out much more efficiency.

I’ve worked on joint optimizations of encoding and players, for example. Sometimes throwing in the mix also “augmentations” of protocols. If the player knows how the encoder is optimized it’s possible to develop improved heuristics and vice-versa with a synergic effect. I’ve already discussed a bit about that trend in this previous post.

In this scenario, I have discussed during Demuxed about another un-usual possibility of joint optimization:

Reconstruct perceptually part of the detail loss in encoding using in the Player a GPU-based reconstruction model that uses information extracted by the encoder or ML to estimate the best parameters.

It’s an old idea I’ve been insisting on for years as a way to ultra-optimize the streaming pipeline, with different tunings for high quality and high-efficiency cases (es: mobile).
I proposed a proof-of-concept based on Flash in a 2010 trilogy of posts and spoke about it also at Adobe Max 2010 in Los Angeles.

After the decline of Flash I’ve waited for WebGL to be more generally available in browsers and devices to make the idea evolve. Now WebGL is very powerfull and filtering with complex pixel shaders also high-resolution content is not a problem.

I’ll elaborate more on the logic in a future post. By now, take a look at the recorded speak and/or at PDF presentation: Presentation-Demuxed2018-FabioSonnati.

I’ve been very satisfied with the level and quality of feedbacks on the topic and in general Demuxed has been a wonderful occasion to meet and chat with high level professionals of the streaming business.

Artificial Intelligence in video encoding optimization

Without doubt A.I. is the buzzword of the moment. We can definitely find it used everywhere, ranging from image classification/recognition to language translation, from sentiment analysis to market predictions, not to mention autonomous driving, fitness bands, latest CPUs/GPUs, smartphones and so on. A.I. prophets promise a new era of “intelligent” computing that will disrupt the way we live and use technology.

ML_trend
Fig1. Google Trend for “Artificial Intelligence”

Is it all that glorious? Even if all that glitters is not gold and many of the expectations are over inflated I think that A.I. (or it’s more correct to call it Machine Learning for most of the applications) is already truly capable to empower engineers with new tools and ways to solve problems, make accurate predictions and design complex systems.

As such, why don’t we apply it also in the field of encoding and streaming optimization?

but let’s start from the beginning…

Artificial Intelligence Machine Learning

From now on I’ll speak about Machine Learning and not Artificial Intelligence. AI is more a marketing slogan than an accurate term to depict current achievements (read this maybe oversimplified yet efficient comparison). In fact, many of the applications often branded as AI-driven are indeed more simply based on ML algorithms.
Not to mention that now that AI is at its peak of inflated expectations in the Gardner’s hype cycle, a lot of more traditional technologies are opportunely rebranded with the new, bold term just to exploit the wave.

ML is not indeed new. It is rooted in the late 50s and 60s when scientists started to study algorithms that can “learn” from data and make predictions based on that data. Algorithms capable to model complex systems from sample inputs and make data-driven predictions or classifications without active modeling by engineers.

ML is based on or is adjacent to other well know disciplines like computational statistic, mathematical optimization, operation research, linear programming; All popular university courses in not so ancient times.

ML has been widely used in the industry for years with success. Every time you use your credit card, a ML-based algorithm estimates the probability of a fraud thanks to classification algorithms trained on a huge amount of transactions (someone has said BigData?). Recognition of digits, OCR, speech recognition, spam detection are other consolidated applications. More recently you find ML-based algorithms in fitness bands to recognize/classify the activity done by users. Netflix has created a famous recommendation engine using ML. Google uses ML extensively for speech recognition, search ranking, form completion, translations. Apple uses it for Siri, among other things and any image classification application is based on deep learning and CNN that are at the cutting-edge of ML.

So it’s true that ML is powerful but it’s nothing exotic. It is essentially a discipline that provides algorithms, methods and best practices that help engineers in creating complex models without analyzing necessarily the underlying phenomena.

Indeed, modeling is something engineers already often do in their daily work. But sometimes analyze and modelize a complex phenomenon is not easy at all. I have already talked about optimization approaches and complex modeling in this post. At the end, instead of studying a complex system by inferring the rules of its subsystems (a classic way to proceed), ML provides engineers with a set of tools to create much more accurate models starting from a wide number of observations and data.

There are many algorithms, techniques, procedures and approaches in ML. A broad distinction is made between supervised learning, unsupervised learning and reinforcement learning. And inside supervised ML we can mention algorithms like linear regression/classification, Support Vector Machine, Random Forest, Decision Trees, Ensemble Methods, Gradient Boosting, Ada Boost and so on, and then continue with the Neural Networks family: Deeplearning, Convolutional NN, Recurrent NN, LSTM RNN, etc…

Wow, it’s a wide and complex landscape where it’s not simple and immediate the choice of the algorithms, the fit and the optimization of the entire system.

There are important points to considerate:

1. ML is a tool-set but then is up to the engineers how to use it in creative and efficient manner. ML doesn’t work by itself!

2. Many ML-algorithms behave like a black-box and it is not easy to extract knowledge of the underlying phenomena from that black-box. Sometimes is preferable a simpler algorithm than a more complex (and more efficient) one when you want to better comprehend the system under study.

3. Overfitting is everywhere! It’s the worst enemy and requires much attention especially to avoid creating models that in reality perform worse than empiric approximations.

Machine Learning as a tool to optimize video encoding

In this post I compared optimization to function approximation/estimation. It’s easy to see the parallelism between function approximation and ML-based regression techniques. Using ML is possible to create a model that “predicts” with a good accuracy the behavior of a system for unknown inputs using only a number of known sample points to train/fit a chosen ML algorithm and minimize the associated cost function.

A mix of ML algorithms can be very useful everytime you have to “optimize” something.
Minimize a cost function means, in fact, optimize and we have already said that ML is based on mathematical optimization, operation research and linear programming, disciplines strictly correlated to the concept of “optimization”.

So even video encoding is a fertile field for ML-driven optimization. In video encoding, we have many independent variables (metrics that describe the features of the video, resolution, target quality, etc…) and the final objective could be (but not only) to minimize quality/bitrate ratio using the right encoding parameterizations.

In recent years Youtube and Netflix have used ML to achieve optimization of specific objectives in video encoding. In the case of Youtube, they have used NN to predict quantization levels that produce the desired target bitrate so to be able to obtain the performance of a dual pass encoding in a single pass. This is an example of optimization of the quality/speed ratio because in the Youtube’s scenario the huge amount of input videos determines a high cost of encoding that this approach tries to optimize.

Netflix has instead used ML (SVM in the specific case) to fuse the performance of elementary objective metrics in a unique reliable subjective quality estimation (VMAF metric, Video Multi-Method Assessment Fusion). VMAF has been used then as an enabling technology for other optimization processes.

Content-Aware to the next level: Perception-based encoding

In the last year, I’ve been involved in an extensive and on-going project of NTT Data that uses ML to optimize encoding. The objective of the project is to take Content-Aware encoding to the next level and be able to encode with a target perceptual quality on screens of different size. I already introduced this as a new emerging trend in a previous blog-post.

Instead of specifying a resolution and bitrate, like in a traditional encoding, now we can specify only the target perceptual quality (es: a MOS rate from 1 to 5) and the max size of the screen on which the video has to be watched. The ML-driven algorithm will determine the encoding parameterizations for each scene of the video to achieve the desired perceptual quality when watched on that target screen size. A high complexity scene will require a higher average bitrate while a low complexity scene will require a lower bitrate. But the actual value and many parameters will depend on input content metrics, target MOS and target screen size.

Such a level of optimization provides a way to minimize the bandwidth consumption using only the amount of bits necessary to achieve the desired level of quality across different screen sizes. At the same time, using advanced player’s heuristics is possible to exploit the VBR encoding produced in output to increase also the QoE during streaming, delivery in average an higher quality compared to traditional types of encoding (es: CBR o capped VBR with a target avg bitrate).

The project has required a massive campaign of subjective quality assessment performed on screens of various size. More than 14.000 quality rates related to human perception have been analyzed, enriched and used to train an ensemble of ML-algorithms. A variable set of elementary metrics (from 4 to 12) are used in different point of the project to characterize sources, encoded videos and codecs’ performance and form the vector of input features for the predictors.

The first working version of this system is going to be used by an important broadcaster in Europe and the results are very promising. For example, thanks to the training with perceptual ratings collected selectively on TVs/Tablets/Smartphones, the average bitrate of a typical TV series like Game Of Thrones with a target MOS of ~4.2 (good in a 1-5 scale) is just 350Kbps on smartphone, 900Kbps on Tablets and 2.1Mbps on TVs, down -64%, -50% and -30% respectively from the bitrates of the previous static profile.

Conclusions

ML is really a precious ally when developing optimizations in a wide range of scenarios. Previously I used to use empirical approximations that worked well but in a sub-optimal way. Now ML allows a better fitting even if it may require a considerable amount of data to work properly.

The next steps are to increase the accuracy and performance of the pipeline, but I’m also exploring the use of ML on the player side of the equation, to optimize even more also ABR heuristics and player’s logic.