Engagement vs Churn in Video Streaming: How to Keep Viewers Hooked

[Note: Article originally published on linkedin]

In the hyper-competitive world of video streaming, keeping subscribers engaged isn’t just a nice-to-have, it’s crucial. Engagement is inversely correlated with churn, meaning the more viewers interact with your content and platform, the less likely they are to cancel their subscriptions. But what exactly does that mean in practical terms? And how can streaming services measure and enhance engagement to reduce churn?

Let’s engage

Engagement refers to the degree to which users interact meaningfully with a streaming platform. It includes both quantity (how much time they spend) and quality (how immersive or satisfying that time is) and higher engagement means higher user’s satisfaction.

Common metrics used to measure the engagement are, for example, the total watch time per month, the frequency of use, the average session length, the completion rate, the early abandon rate and in general the viewing behavior. The indicators of a satisfying and “engaging” experience are many and not so straightforward to interpret, so are the indicators of an unsatisfying experience that are useful to recognize because they can anticipate the abandon of the service.

To deepen the understanding, engagement can also be broken down into active and passive dimensions. Active engagement includes user-initiated actions such as searching for content, creating watchlists, rating titles, and interacting with UI features like skip buttons or content previews. Passive engagement refers to behaviors like binge-watching a series or letting autoplay continue or simply consume a lot of content, every day. Both are valuable: active engagement suggests intentionality and platform loyalty, while passive engagement may indicate frictionless consumption.

Moreover, it’s important to track engagement across devices and other user’s behaviors. For instance, mobile sessions might be shorter but more frequent, while smart TV sessions are longer and more immersive. Cross-device continuity and resumed playback are also strong signals of user engagement  and satisfaction with the service.

What is Churn ?

Churn refers to the rate at which subscribers cancel their streaming service subscriptions over a period, typically last month.

The reasons why a user abandon a streaming service are many, but we can subdivide them in 3 categories: QoE related (subpar quality, technical issues, bugs, instability, not flowless experience), Content related (not interested to content, content not updated, difficulty in content discovery etc..) and Other reasons (free trial expires, too expensive, temporary cancellations between content releases).

Data Analysis may help in the task of interpreting the user’s Engagement and likelihood of churn. Advanced platforms deploy machine learning to predict churn weeks in advance using behavioral signals such as reduced session frequency and duration over time, longer browsing times without playbacks, or low variability in content consumed.

These behaviors serve as early warnings and help platforms trigger retention strategies—such as targeted offers, personalized push notifications, or more simply should trigger internal alarms to analyze the phenomenon, find the root cause and fix it, especially when it interests a significative part of the audience.

The Inverse Correlation: Why Engagement Reduces Churn

Multiple studies and real-world cases show that higher engagement directly lowers churn. When users find value and entertainment in a platform, they are less likely to leave.

For example, this study from Owl&co (2024) confirms the huge impact that engagement can have on churn (and streaming companies balance). Streamonomics: Engagement vs Churn, Quantified

Article content
Inverse correlation between Churn and Engagement

It’s clear that, if engagement reduces churn, it’s of crucial importance to measure engagement and work to increase it as much as possible, eliminating elements that contribute to a reduction of engagement and introducing new elements or improve existing ones that contribute to an increase.

What Drives Engagement in a Streaming Service?

Improving engagement isn’t just about having more content, it’s also about designing the experience to keep users coming back. This means not only well designed streaming app but also high quality, flowless streaming and -definitely- high QoE in video streaming.

Obviously content is king and so everything that revolves around the content are essential to increase the engagement. New and exclusive high quality content presented in a catchy way and contextualized with the viewer preferences are the key factor.

However, a bad UX with navigation problems, bugs, convoluted logics or invasive advertising can hamper the experience reducing the potentiality of high quality content, especially in the long run, when compared to better UX and flowless experience.

At the same time, and even more importantly, low video quality, inconsistent streaming performance, rebuffering can have a huge impact on engagement. The best streaming platforms have accustomed us to a very refined QoE and any worsening is difficult to tolerate for the end user.

Article content
Competition is fierce, any problems in streaming risks to increase the Churn.

How to Measure and Analyze Engagement Effectively

Analytics platforms like Mux, Conviva, NPAW’s Youbora and in-house data teams often use dashboards that combine raw logs with advanced metrics.

Event-driven telemetry is also key. Platforms instrument players and apps to emit real-time events (e.g., play, pause, seek, stall, buffer ratio, dropped frames) that feed into data lakes for batch or streaming analysis. These events are often correlated with user retention metrics using cohort analysis and regression modeling.

Engagement scoring models also integrate heatmaps of UI interactions, time-to-first-play (TTFP), and UI responsiveness to detect friction points. Machine learning models can segment users based on engagement profiles, enabling targeted actions like in-app tips, onboarding flows, or churn-prevention incentives.

When an effective practice to measure engagement is defined, is of crucial importance to have a well implemented A/B testing practice to measure the real impact that changes or new features can have on the final users.

Finally, many companies are exploring synthetic monitoring (simulated sessions on real devices) and stress testing to measure engagement-affecting bugs or performance drops before actual users are impacted. Even more insightfull is a benchmarking service like NTT Data’s OTT Observatory where multiple competing streaming services are analyzed, compared and benchmarked with objective KPIs to estimate the user’s point-of-view.

Strategic Excellence: The Power of Technology in Shaping the QoE

Today, in a crowded and commoditized market, technological excellence is no more a differentiator—it’s a strategic necessity. A well-engineered streaming infrastructure can subtly but powerfully shape user perception, reduce friction, maximize Quality of Experience (QoE), and ultimately solidify long-term customer loyalty & engagement.

Great content may draw attention, but it’s the execution that slowly earns or destroys trust.

True excellence lies in mastering the complexity of the user-platform relationship. This means not only ensuring reasonably fast, buffer-free video playback and quick UI response, but also cuddle users through a refined, gorgeous and rock solid video quality.

But how to dominate such complexity ?

To dominate complexity, a platform must continuously collect, correlate, and act on data from multiple fronts: QoE measurements, Behavioral metrics and Engagement signals.

But metrics alone aren’t enough. The real advantage comes when this data feeds into a culture of testing, learning, and adapting. A/B testing should become a permanent practice, not just for content layout or recommendations, but for technical changes: new encoders and players, UI transitions, bitrate ladders, AI-driven ABR heuristics, even audio codec swaps.

When technology, data, and culture align, platforms can proactively refine, not just react. This mindset of continuous optimization allows services not only to detect pain points but to resolve them before users churn. It is a holistic approach where engagement, retention, QoE, and UX are treated as parts of the same system, not separate silos.

And yet, all this leads to one inevitable frontier: understanding and measuring Quality of Experience in real time.

As discussed, video quality It’s a front-line component of engagement, and when measured precisely moment by moment in each streaming session, together with other KPIs becomes a compass for every strategic decision, from encoding strategy to UI prioritization.

Measuring Video QoE at the Core

At the heart of any effort to optimize engagement lies a deceptively complex question: how good is the user’s experience, in this moment, for that user ?

To answer this, modern streaming platforms are moving beyond traditional quality KPIs like resolution and bitrate embracing per-scene video quality assessment. The process begins with extracting frame-level/scene-level perceptual quality scores and content complexity indicators, producing a rich stream of metadata that can be interpreted in the context of each specific viewing session.

These data points are then correlated with external variables such as:

  • Display size and device class (e.g., smartphone vs. 4K TV),
  • Viewing context (on the move, in the living room),
  • Content type (e.g., animation, dark scenes, high-motion action),

Once collected, this quality metadata is cross-referenced with session-level analytics and CDN delivery logs, including classical KPIs like:

  • Video Startup Time
  • Stall events and buffering,
  • Network conditions and adaptation behavior.

The outcome is a composite view of the session, a granular, data-driven map of how the stream was delivered and perceived. This lays the foundation for highly targeted A/B testing practices, where QoE predictions and session outcomes can be validated experimentally.

Did lowering the peak bitrate on low-complexity scenes actually preserve perceived quality on tablets? Does a more aggressive encoder preset cause subtle artifacts in high-contrast HDR scenes? Can adjusting buffer targets during scene transitions reduce rebuffering without hurting responsiveness?

These are no longer “rhetorical” questions. They become testable hypotheses in A/B tests. In the end, also the effect of those technical improvement can be cross-tested to evaluate impact on engagement KPIs.

Article content

A Multi-Metric approach for Quality

A robust video quality framework should rely on multiple complementary metrics, each covering different aspects of visual fidelity and viewer perception. Among the most effective today we can mention:

VMAF (Video Multi-Method Assessment Fusion): the industry workhorse, effective for general perceptual quality measurement but with some well-known limits.

CAMBI: excels at detecting banding artifacts, particularly in gradients and low-light scenes.

IMAX XVS® Suite: a professional-grade toolkit focused on frame-by-frame perceptual video quality under strict visual norms.

While these tools are powerful, relying on a single score (especially VMAF alone) can be misleading. Over-optimization is always a risk and can create a false sense of control. In my measurement and optimization practices I’m profitably mixing VMAF with the IMAX XVS Suite and other complexity-indicators (i.e. NTT Data HHPower and bIndex) to have a more comprehensive and accurate multi-metric point of view. I found the IMAX XVS® Suite and in particular the NR-XVS® metric, to be significantly complementary to VMAF, VMAF neg and Banding Metrics (like Cambi and bIndex) and compensate for the limit of VMAF and the risk of overfitting that can emerge from optimizing video streaming to maximize a single metric.

VMAF provides an estimation of the fidelity of the encoded video while NR-XVS® provides a more absolute score on the pleasantness of the video sequence. This difference can be exploited to gain a deeper insight into the actual quality perceived by users in different viewing conditions and achieve different and more extreme level of optimization mitigating at the same time the risk of overfit.

A multi-metric approach is not just more accurate and less error prone—it’s also more actionable. When metrics remain individually exposed, they become input channels for predictive models capable of:

  • Better estimating real QoE in various streaming conditions and devices
  • Recognizing perception thresholds/corner cases based on content and context,
  • And enabling CAE (Context-Aware Encoding), which dynamically adapts renditions, encoding parameterizations and delivery strategies to match the complexity and relevance of what is actually being watched.

In short, the final frontier is no longer just measuring quality. It’s about understanding it at scale, in real time with the objective to “master” it.

Because when you can see the quality of experience as clearly as the bitrate or frame rate, you unlock the ultimate lever of competitive advantage: the ability to shape, not just serve, viewer expectations.

Conclusion: Engagement as a Strategic Monopolizer of User Time

While high engagement is a clear sign of user satisfaction (and a powerful antidote to churn) it also triggers a second, often underestimated dynamic: the monopolization of user attention. As engagement crosses certain thresholds, the user’s available time becomes saturated by the platform itself, leaving little or no opportunity for competitors to intervene.

The more time a user spends within a service, the less cognitive and emotional bandwidth remains for rival platforms, whether they’re other streaming services, social media, or even gaming ecosystems.

Netflix has clearly embraced this logic: Its expansion into cloud gaming, mobile content, and future social-like features reflects an ambition to anchor itself deeper into the daily attention economy of the household. After all, it’s not just Prime Video or Disney+ that compete for screen time, it’s also TikTok, YouTube Shorts, and Instagram.

By capturing more moments in the user’s day, Netflix isn’t just increasing satisfaction. It’s preemptively reducing the window of opportunity for any other platform to insert itself. In the modern digital landscape, where time is the ultimate finite resource, owning the user’s attention equals owning the market.

Challenges of New Encoding Scenarios: Reflections on Measuring Perceived Quality

[Note: this article was originally published on Streaming Media]

In the ever-evolving landscape of video technology, new encoding scenarios present a set of new challenges in measuring accurately perceived quality. Accurate quality measurement is essential for assessment and even more for optimization, enabling us to fully exploit the potential of those advanced scenarios.

The evolution of modern codecs and the integration of artificial intelligence (AI) have paved the way for significant advancements in video compression and quality enhancement. However, traditional metrics used today for assessing quality, such as VMAF, may be inadequate for those innovative approaches.

The Rise of Film Grain Synthesis and AI in Modern and Future Codecs

Modern codecs like AV1, VVC, and LCEVC are at the forefront of the current technological shift in video compression. One of their most notable features is the native support for film grain synthesis (FGS). Film grain and sensor noise, elements that add a sort of natural and cinematic feeling to video content, are ubiquitous but have historically been challenging to compress. Traditional methods struggle to maintain the quality of film grain without significantly increasing the bitrate. This is particularly challenging with codecs like H.264 and H.265, where Film Grain Synthesis (FGS) support is not standard. In these cases, grain and noise must be treated as high-frequency details and processed using motion estimation, compensation, and block coding—just like any other moving or static part of the image. Handling these elements effectively is a complex task.

The innovative approach behind FGS is to generate film grain algorithmically during playback instead of trying to compress it. The logic is to measure and remove grain during compression, to later re-introduce it algorithmically during playback, transmitting only inexpensive parameters for the reconstruction.

This method drastically reduces the amount of information needed, making it possible to achieve high-quality film grain with minimal data transfer. This leap in compression efficiency “would be” a game-changer for the new codecs and one of the reasons that may push their adoption. I say “would be” because, currently, it is not widely used because of the aforementioned problems in quality assessment.

The reconstruction of the grain is, perceptually speaking, very pleasant and hardly discernible from the original, but pixel for pixel, the high-frequency signal is very different from the original, causing an underestimation of quality in full-reference metrics like PSNR, SSIM, and also, to a minor extent, VMAF.

In fact, VMAF is not fully reliable when grain retention comes into play, and so it’s currently difficult to guide the optimization efforts during codec or encoding pipeline tuning because it requires extensive and slow subjective assessments.

In many years of experience with this metric, some points of weakness emerged, in particular, an insensitivity to banding and to grain retention.

In Figure 1, you can see an example of excessive compression produced by Netflix with AV1. Notice that we are speaking of base AV1 without FGS, which Netflix has not yet used. Both AV1 and HEVC in the picture have a target VMAF of ~94, but the final AV1 result is plagued by banding, with a substantial elimination of high frequencies and grain. Globally, the AV1 encodings appear to have a plastic feeling: sharp edges but zero grain and emerging banding. If VMAF fails at assessing a proper grain retention, how could it assess the case of FGS? We definitely need a more reliable way to assess these scenarios.

AI-Powered Video Quality Enhancement and AI-based Codecs

Another transformative development is the use of AI models to enhance video quality. These AI models work by correcting artifacts, adding intricate details, and providing an apparent increase in resolution or a full, super-resolution upscale. This not only improves the visual experience but may eventually be tuned to enhance encoding efficiency. The ability to increase perceived quality while maintaining or even reducing data requirements is a significant breakthrough in video technology.

However, again, these advancements introduce complexity in quality assessment. Current full reference metrics like VMAF are designed to measure quality relative to the original source video. When AI-enhanced techniques introduce new details, correct artifacts (like banding), or modulate grain, these metrics may not accurately reflect the improved visual quality. In fact, they might even suggest a decrease in quality because of the introduced “distance” from the original source. For example, if AI reduces banding eventually present in the source or enhances a vanishing texture, it introduces something “new” that’s identified by traditional full-reference metrics as an encoding degradation/distortion, even if, perceptually speaking, it’s an improvement.

In Figure 2, you can see on the left the original picture (A) and on the right the enhanced one (B). After encoding both at a given bitrate, we will have a higher quality result when starting from B compared to A, but if we measure the quality with a full-reference metric as a measure of the degradation of quality compared to the source, we may end up with quite different conclusions.

Figure 2. The original image is on the left, the AI-enhanced image on the right.

In Figure 3, there are other examples of AI enhancements and deep neural network (DNN) super-resolution that introduce or better “evolve” details and increase the apparent resolution and sense of detail. But again, to assess reliably those improvements, we need more “absolute” metrics capable of estimating the quality on an absolute scale and not as a degree of degradation from the source.

Figure 3. More AI-enhanced and DNN super-resolution images

An even more complex scenario of future hybrid or purely AI codecs poses new challenges in quality assessment and codec fine-tuning because those codecs will have completely new types of artifacts and distortions and may use generative techniques to create realistic textures and features at very low bitrates, features that will sacrifice fidelity to pleasantness and likelihood.

The Need for New Quality Metrics

VMAF is a popular and widely used metric today. It is based on four elementary metrics correlated to a predicted quality score using a Support Vector Regressor (SVR) trained using subjective quality scores collected with Absolute Score Rating (ASR) methodology. This last detail suggests a certain degree of flexibility in predicting quality, but while it has been successful in many applications, it does have certain limitations that can impact its effectiveness in specific scenarios.

Here are summarized some of the key limitations of VMAF:

  • It is a full-reference metric, meaning it requires access to the original, uncompressed source video to compare against the encoded video. This may be a limit in some scenarios.
  • It has been assessed with subjective data collected in “standard” viewing conditions; it lacks a more sensible/demanding model to intercept subtle or emerging artifacts.
  • VMAF, while sophisticated, may not always accurately capture all types of video artifacts. For example, it has struggled with detecting specific issues like banding, blocking, or noise, which might not significantly impact the VMAF score but are perceptually noticeable to viewers.
  • It is designed to assess traditional compression artifacts and may not be well-suited for modern encoding techniques that introduce new types of visual changes. For instance, with the advent of AI-enhanced video quality improvements and film grain synthesis, VMAF may not correctly assess these enhancements and might even penalize them as degradations.
  • VMAF generally measures fidelity to the source video, which is not always synonymous with perceptual quality. Modern video enhancement techniques, like AI-driven super-resolution, can increase perceptual quality by adding detail or correcting artifacts, yet they may reduce fidelity to the original source. VMAF might not appropriately reflect these improvements, sometimes even indicating a lower quality score despite perceptual enhancements.

To address those issues, there is a growing need for objective metrics that can provide a more “absolute” quality score. Such metrics should evaluate the overall perceptual pleasantness of the video, rather than just fidelity to the source. This would allow for a more accurate assessment of the quality improvements brought by modern codecs, FGS, and AI enhancements.

One promising tool that I’m experimenting with is IMAX NR-XVS (formerly SSIMwave SVS, Now part of the IMAX StreamAware | ON-DEMAND suite). NR-XVS is a no-reference metric that estimates the perceptual quality of a video sequence without needing access to the source video. It utilizes a DNN to correlate video features with subjective scores on a frame-by-frame basis over an absolute 0–100 quality scale.

In the practice, XVS has demonstrated good sensitivity and linearity, making it a reliable tool for assessing video quality in scenarios where traditional metrics fall short. Before using XVS to assess video quality in no-reference scenarios, I studied the response and linearity of the metric measuring clips encoded with x264 and x265 at various resolutions and bitrates or constant rate factors (CRFs). The statistical distribution is illustrated in Figure 4.

svs statistical distribution

Figure 4. SVS statistical distribution

The metric returned quite linear and proportional results to increases in bitrate and CRF, coherently with subjective scores. When applied to the cases of Figure 1, XVS is able to identify the lack of grain and the plastic “feeling” of Netflix AV1 (and HEVC to some extent) compared, for example, to the same content compressed by Amazon Prime (HEVC), showing a difference of more than 3 XVS points, that’s near a JND (Just Noticeable Difference).

In general, the metric is sensitive to the presence of banding, to a poor high-frequency domain, and to proper edge and motion reconstruction. It’s not perfect, but it’s very promising, and the underlying model will be able in future to estimate the quality of many device types.

Exploring Hybrid Quality Assessment Approaches

While XVS and similar no-reference metrics show great potential, there is also a need for hybrid approaches that combine no-reference and full-reference metrics. This could provide a more comprehensive quality assessment, balancing perceptual pleasantness with fidelity to the source. For instance, a weighted score that considers both absolute and relative quality could offer a more nuanced understanding of video quality.

Projects like YouTube’s UGC no-reference quality assessment metric have attempted to address these challenges, but they often lack the accuracy and linearity required for the high-quality demands of OTT streaming services. Therefore, the development and adoption of reliable no-reference metrics, or hybrid systems, are crucial for optimizing new codecs, especially when film grain synthesis and AI enhancements are involved.

IMAX proposes the FR XVS metric with a similar hybrid approach. It is a full-reference metric, but in contrast to the “legacy” EPS, which is similar to VMAF in logic, FR XVS considers both source quality (NR XVS) and encoder performance (EPS) and accounts for information loss during encoding. This delivers a combination of source quality, full-reference quality EPS, and psychovisual effects in the model.

In the upcoming months, I plan to assess FR XVS to better understand if it is the solution to at least some of the problems we have discussed here.

Conclusion

The landscape of video encoding is rapidly evolving, driven by advancements in modern codecs and AI technology. These innovations offer tremendous opportunities for improving compression efficiency and visual quality. However, they also present new challenges in measuring perceived quality.

While VMAF has been a valuable tool for assessing video quality, its limitations highlight the need for complementary metrics or the development of new assessment methods. These new metrics should address VMAF’s shortcomings, especially in the context of modern video encoding scenarios that incorporate AI and other advanced techniques. For the best outcomes, a combination of VMAF and other metrics, both full-reference and no-reference, might be necessary to achieve a comprehensive and accurate assessment of video quality in various applications.

My presentation at Demuxed 2024

Six years after my last talk, I had the great pleasure and honor of speaking again at Demuxed 2024. In the presentation, I shared a real story from my work — including a phone call I got in the middle of the night about an issue that was affecting hundreds of millions of viewers around the world. The infamous Game of Thrones ‘dark night’ episode was turning into the “darkest night” of my professional career ? 🙂

The answer? Luckily, no. By digging into old notes, past experiments, and over 20 years of experience in the streaming world, we managed to solve the emergency within a few hours.

If you’re curious, you can find the video of the talk right below

Celebrating 20 Years of H.264: the foundation of modern Internet Streaming

Exactly 20 years ago, May 2003, the Joint Video Team (VCEG and MPEG) approved the first version of the video codec known as H.264/AVC, a groundbreaking standard that would forever change the world of video. H.264, not only revolutionized video compression but also gave birth to and propelled the era of Internet streaming. It has enabled billions of computer, mobile devices, tv sets to record and playback videos with increasing capabilities over the years, adapting to the progressive increase of connections speed and video resolutions. It has democratized video creation and consumption and has been crucial in making video ubiquitous.
Today, H.264 stands as one of the most successful international standards in the history of computer science, and its resilience over time demonstrate the exceptional work done by those brilliant researchers and scientists 20 years ago.

Standards play a crucial role in ensuring interoperability and preventing the concentration of technology control. H.264’s success as an open standard has empowered a diverse range of manufacturers, content creators, and streaming platforms to embrace and adopt it. This has fueled innovation, competition, and collaboration, benefiting end-users with a rich multimedia experience across devices and applications.

I like to mention also two other key events that have helped H.264 to become the de-facto standard for video streaming: the debut of an efficient H.264 decoder in the Flash Player (2007 – Flash Player 9) and the birth of the OSS encoder x264 (2004). The first event empowered almost 1 billion desktop computer with the capability to decode H264 videos inside a browser, without switching to an external application. Several years later H264 was adopted by every browsers natively with HTML5 Video but in those early years H264 in Flash Player enabled / enhanced the experience provided by foundational services like Youtube or BBC’s iPlayer.

The second mentioned event is not less important. x264 has contributed to an efficient implementation of H.264 encoding. The work started by Laurent Aimar and masterfully continued by, among the others, Loren Merritt and Fiona Glaser, has been truly exceptional and foundational as well.

Looking ahead, the question arises: can future codecs seamlessly carry the torch from H.264 and guide us through the next 20 years with the same virtuosity? The technology landscape is constantly evolving, and advancements are being made in video encoding and streaming. Subsequent codecs, such as H.265 (HEVC) and the more recent AV1 and VVC, have emerged, promising improved compression efficiency and enhanced visual quality.

Furthermore, future-generation codecs aim to address the growing demand for higher resolutions, immersive experiences, and bandwidth optimization. They will leverage cutting-edge techniques, including machine learning and artificial intelligence, to further refine video compression algorithms. However, they will face the challenge of not only surpassing H.264’s technical capabilities, which after 20 years is an easy task, but also gaining widespread adoption and compatibility across a vast ecosystem of devices and platforms. A real challenge indeed: earning the trust, the enthusiam of the entire video ecosystem, that enthusiasm that made H264 so crucial and so obiquitous to our industry.

As we celebrate the 20th anniversary of the standardization of H.264, let us acknowledge its immense contributions to the world of video and thanks to everyone who contributed to this revolution. Happy birthday H264.

FCS and RTMP – Streaming Technologies from the future

I remember clearly the enthusiasm and excitement when I moved my first steps in streaming and real-time communication. I was already fond of video compression and interactive web and was working actively with Flash community to create interactive experiences, but my career took a turn when Macromedia released Flash Communication Server 1.0. It was September 2002 and after 20 years the RTMP protocol, one of its foundational technology, is still among us!.

Pritham Shetty (helped by Jonathan Gay – the father of Flash) was the ingenious main author of this milestone in the history of video streaming. Pritham had already an extensive expertise in real-time communication for the web and for example, in 1996 he developed for NTT a Java based web client for connecting multiple users in a synchronized experience. And again, in 1996 a personalization server he developed was used even by Netflix! (when it was still very distant from the company we use to know today).

FCS was an exceptional server capable to enable real-time communication, live and on-demand video streaming features in Flash Player 6.0. The architecture of FCS was really ahead of its time and when I started working with it I had only a 640Kbps down/128Kbps up ADSL connection and a 64Kbps GPRS phone and nonetheless it was possibile to communicate in real time with other users over such connections and create futuristic interactive applications.

As we all know, 18 years after this exceptional release, the entire world has depended on real-time communication technologies like Microsoft Teams, Google Meet or Zoom because of Covid-19 pandemic. Think of FMS as a playground where Flash developers could easily develop video conference applications similar to those, with multiple audio-video-data streams produced in the browser by Flash, transported via RTMP protocol, orchestrated server-side by FMS and consumed again on a Flash Client.

I think the main advantage of this stack was simplicity and elegance and I’ve always used the lesson learned with FCS in my career as an Architect of Media Solutions. At the FCS’s foundation there was a non-blocking I/O stack scriptable in Action Script 1.0 (essentially JavaScript). Every user connection, application startup, disconnection and in general interaction raised an event and the code responded with actions or async I/O operations to connect RTMP streams in publish mode to RTMP stream in subscription mode and orchestrate via script many other interactions and data sharing (Curiously the architecture is very similar to Node.js. When Flash was abandoned I started easily to work with an early Node and FFmpeg to substitute many of the use cases I used to serve with FCS).

The simplicity and high efficiency of RTMP is also the main reason why it is still used today. RTMP allows to stream interleaved audio, video and data tag on TCP, SSL or tunneled in HTTP(S) and it’s possible trasparently to pass from real-time (few ms), to live and to vod use cases, with RPC call interleaved in the stream and easily recordable for interactive reproduction of communication’s sessions.



When working with this stack you had literally infinite possibilities a decade before webRTC was conceived and I ended to be an expert in FCS (then known also as Flash Media Server / Adobe Media Server) developing many advanced applications in the next 10 years (for example, thanks to the flexibility of the duo Flash + FCS I was able to design one of the first implementation of adaptive bitrate streaming for the first catch-up TV in Italy in 2008). 

Unfortunately FCS/FMS/AMS has not had the success and the widespread diffusion it deserved because of an absurd and limiting pricing model by Adobe. Nonetheless it has left an undeniable contribution to the Internet streaming.

Happy 20th Birthday RTMP and kudos to the great FCS and its authors.

Defeat Banding – Part II

Recently Banding has finally become an hot topic in encoding optimization. As discussed in this previous post, it is nowadays one of the worst enemy for an encoding expert especially when trying to fine-tune content-aware encoding techniques.

Banding emerges when compression reduces too much high frequencies locally on a frame and this splits a gradient in individual bands of flat colors. Those bands are therefore easily visible and reduce the perceptual quality.

For years I’ve underlined that even a useful metric like VMAF was not able to efficiently identify banding and that we needed something more specific or a metric like VMAF but more sensible to artifacts in dark or flat parts of pictures and hopefully a no-reference metric to be usable for source file assessment as well as compressed ones.

FIG.1 – Lack of correlation between VMAF and MOS in case of sequences with banding (Source: Netflix)

As anticipated in the previous post, I started in 2020 experimenting with some PoC about a metric to measure banding and the next year I validated the logic working for one of my clients at a “bandingIndex” metric. I’ll call it bIndex for sake of simplicity.

Significantly, even Netflix was working on banding and presented (Oct 2021) their banding detection metric Cambi. Cambi is a consistent no-reference banding detector based on pixel analysis and thresholding, plus many optimizations to have solid and accurate banding identification.

The logic I’ve used is very different from Cambi and can be used to identify not only banding but many types of impairments using what I call the “auto-similarity” principle.

The logic of source-impaired similarity

The logic I explored is illustrated in the picture below:

FIG 2 – Auto-Similarity principle

A source video is impaired to introduce an artifact like blocking, banding, ringing, excessive quantization and similar.

if the impaired version of a video is still similar to the not-impaired self, this means that the original video has already a certain degree of that impairment. That degree is inversely proportional to the similarity index.

I call it “source-impaired similarity” or sometimes “auto-similarity” because a video is compared to itself plus an injected, controlled and known impairment. The impairment need to be one-off and not cumulative. Let me explain better:

For one-off impairment I mean a modification that produces its effect only the first time it is applied. For example a color-to-gray filter has that characteristic, if you apply it a second time, the result doesn’t change anymore.

Now we have to things to choose: the impairing filter and the similarity metric.

So let suppose we want to find if a portion of video has banding, or excessive quantization artifacts, we can, in this case, use as impairment a quantization in frequency domain. This form of impairment has the characteristic described above: when applied multiple times, only the first application produce a distortion, the next ones do not modify the picture that’s already quantized with a known quantization level.

The most used similarity metric is SSIM. It maxes to 1 when videos are identical and goes below 1 when dissimilarities arise. It is more perceptual aware than PSNR and more insensible to small deltas as long as statistical indicators like mean, variance and covariance are similar.

It’s very important to analyze the video divided in small portions and not as a whole, especially during metric fine-tuning, to better understand how set thresholding and verify the correct identification of the artifact. Then it is possible to calculate also an “area coverage percentage” that provides interesting information about the amount of frame area impacted by the artifact under test (banding or other).

The high level schema below illustrates the metric calculation. The fine tuning of the metric requires other processings like pre-conditioning (that may be useful to exalt the artifact), appropriate elaboration of SSIM values to keep only the desired information (non-linear mapping and thresholding), final aggregation of data to summarize (pooling) a significative index for each frame.

FIG. 3 – Extraction of bIndex

Conclusions

To develop, verify and fine tune the bIndex metric, I extended a custom player I developed in the past for frame-by-frame and side-by-side comparison. In the pictures below you can see indexes for each frame-area that are green when banding is not visible and are red when banding is visible and annoying. The first picture shows also an overlayed, seekable timeline that plots the banding likeliness for each picture area and the threshold that differentiates between irrelevant and visible/annoying banding. In this way it’s possible to seek quickly to frame sequences that contain banding and evaluate the correctness of the detection.

This approach could be extended to many types of artifacts and used to assess various types of video (sources, mezzanines, compressed video) with different thresholds. Having statistical indicators from frame coverage percentage is also useful to take decisions like source rejection or content re-encoding with specific profiles to fix the problem. Note that currently the thresholds have been identified using perception of small panels of golden-eyes on big-screens but in the future more complex modeling could be used to correlate the objective numbers with perception and introduce other improvements like time-masking and context-aware banding estimation.

HYPER: a decade of challenges and achievements.

Who follows me since the years of Flash Player and Adobe Media Server knows that I’ve been busy in the last 15 years developing encoders, players and in general software architectures to enable, enhance and optimize video streaming at scale. I’ve achieved many professional successes working on innovative projects for companies like NTT Data, Sky, Intel Media, VEVO and many others. In such contexts I’ve had the opportunity to meet inspiring people: managers, engineers, colleagues and ultimately friends that helped me in growing as an engineer and as a video streaming architect.

In particular this 2021 I celebrate the 10th anniversary of Hyper, one of those achievements, but let’s start from the beginning:

Conception

In 2008 I started collaborating with Value Team, a leading system integrator in Italy (later acquired by the global innovator NTT Data). The BBC’s iPlayer was just released and media clients started asking for something similar, so NTT Data contacted me to design a high performance platform (encoder and player) for the nascent market of catch-up TVs and OTT services. The product, VTenc, powered the launch in 2009 of the first catch-up tv in Italy (La7.tv owned by Telecom Italia).

The most innovative feature of VTenc was the possibility to encode a single video in parallel, splitting it in segments then distributed on a computing grid for parallel encoding. The idea emerged after a discussion with Antony Rose and his team (the creators of BBC’s iPlayer) where they underlined that one of the main problems in encoding for a catchup tv was the long processing time that delayed the distribution of the encoded stream after the conclusion of the show on tv.

A few months and many technical challenges later, the feature was ready. And the idea of the parallel encoding was successfully applied to La7.tv: we received the live program in “parts”, emitted every time there was an advertisement slot. Each part, usually 20-25 min long, was diveded into smaller chunks, encoded in parallel and then reassembled and packetized with a map-reduce style paradigm. Also thanks to a client side playlist, the final result was ready for streaming after just 10 minutes from the conclusion of the live show.

An incredible result for that time because commercial encoders required many hours for an accurate 2-pass encoding of assets 2-3 hours long. It was also one of the first implementations in absolute for adaptive bitrate streaming in Flash (Custom implementation in Flash Player 9 + AMS when Adobe introduced officially adaptive bitrate only in Flash Player 10).

The birth of Hyper

VTenc was improved in the following years until in 2011 NTT Data signed a deal with Sky Italy to provide the encoder for the new incoming OTT services of the broadcaster.
We implemented new features and they choose VTenc for a variety of key points:

– flexible queue management system
– rapid customizability
– high video quality
– high density and scalability
– short time to market.

VTenc evolved into something more complex, NTT Data Hyper was born. The model was something different than buying a commercial encoder that usually had long evolution cycles and well defined but unflexible road maps. Hyper has been something more similar to a focused, tailor made and optimized encoding engine like those build by Netflix or Amazon. Often also a sandbox where to conceive and test new technologies and ideas.

In this 2021 we celebrate the first 10 years of Hyper.

A decade of innovative achievements and milestones

Since then we have reached many achievements and milestones facing the challenges of the last decade. Looking back, it has been an exciting journey, professionally intense and enriching. Some achievements that deserve to be mentioned:

– A long story of content-aware encoding approaches from the first empirical versions (2013) to the use of ML to implement peculiar “targetMOS” and “targetDevice” encoding modes (2017+). I’ve been always a fan of “contextual optimizations” and started my experiments in the years of Flash Video, presenting some initial ideas at Adobe Max 2009-11 but then I’ve had the opportunity to implement those paradigms in the industry in various “flavors”.

Internal caching logic for elementary streams that allows the quick repurposing of previously encoded assets without executing a new expensive encoding. With this logic we have been able many times to repurpose libraries (tens of thousands assets) in a matter of days. In this way, add a new audio format, change ident, parental or other elements in the content playlist, or add a new packetization format (es: new version of HLS) has been always quick and inexpensive.

– In 2016 Hyper evolved from a grid computing to a hybrid cloud paradigm with on-prem resources that cope with baseline workload and cloud resources that satisfy peaks. Having thought the software since the beginning around agnostic services and flexible work queues, the hybridization was a natural step. Resources can be partitioned to have maximum throughput and cost efficiency on some queues as well as minimum time-to-output on others. 

– In this last context my team designed a 2-step technique to generate on the fly a smart mezzanine with controlled perceptual quality to quickly and conveniently move very big high quality sources to the cloud for parallelized encoding (with a bandwidth reduction of up to an order of magnitude). 

– Full cloud deployments on AWS and GCloud that quickly and dynamically scale from just 2 on-demand instances to thousands spot instances (“elastic texture”) with optimized scaling logics to minimize infrastructure costs and provide higher reactivity than standard scaling systems like autoscaling groups.

Now that Hyper turned 10 and after various millions of encoding jobs, in 2021 we are going to finalize Hyper v2 and tackle new challenges (VVC, AV1, complete refactoring, perceptual-aware delivery, agnostic architecture to apply massively parallelized processing to other contexts), but that’s the matter for an entirely new story…by now let’s celebrate:

happy 10th birthday Hyper!  


15 years of blogging about Internet Video

15 years ago I started this blog to share my esperiments and points of view around video streaming, playback and encoding. It has provided important opportunities to my professional career and extended my circle of contacts in the world of video streaming professionals, and for that I’m grateful…

Unfortunately (or fortunately depending by the point of view) I’ve not always had the time a blog deserve, especially in the last 5 years… but after more than a hundred articles and almost 2 Million contacts I can say that the objective has been nevertheless achieved.

In the meanwhile the trends of technical communication changed profoundly, We’ve seen the rise and trasformation of social media platforms like Facebook and Twitter, the increasing role of Linkein in presenting and sharing Ideas in a professional environment or the role of Youtube as one-stop-shop for presentations and conferences. I think however that a blog can still be a useful place where to consolidate, share and persist ideas and contribute to the community.

For the future, I’m trying to reorganize my activities to find more spare time to disseminate knowledge and experiences had especially in the last 10 years, writing more posts and partecipating more to web conferences (hoping then to restart live partecipation asap).

It could be interesting to completely refresh my series FFmpeg-The Swiss Army Knife of Video Internet (there are so many things to say about it and ways to use it more productively) or analyze technically the state-of-the-art codecs like AV1 and VVC like I did for H.264 and H.265 in the past, or again continue to analyze optimization’s trend and new challenges, especially related to video processing architectures.

I’m rolling up my sleeves, stay tuned…

Defeat Banding – Part I

In my last post I have discussed about what I think to be the current arch-enemy of video encoding: “banding“.

Banding can be the consequence of quantization in various scenarios today, particularly when the source is a gradient or a low power textured area and your CAE (Content Aware Encoding) algorith is using an excessive QP.

Banding is more frequent in 8bit encoding but is possible also in 10bit encoding and is also frequent in high quality source files, or mezzanines when they have been subject to many encoding processes.

Modern block based codecs are all prone to banding. Indeed I find h265, VP9 and AV1 to be even more prone to banding than h264 because of wider block transforms (and that contributed to an increase of banding in Youtube and Netflix videos in recent times).

As discussed in the previous post, it is easy to incur in banding also because it is subtle and it’s not easy to measure it. Metrics like PSNR, SSIM but even VMAF are not sensible to banding even if it is easy for an average viewer to spot it, at least in optimal viewing condition.

This is an example of banding:

Picture with banding on the wall

The background shows a consistent amount of banding especially in motion, when the “edges” of the bands move coherently and form a perceptually significative and annoying pattern. Below the picture has the Gamma exalted to better show the banding.

Zoomed picture with exalted gamma

Seek to prevent

To prevent banding is first of all necessary to be able to identify it. This by itself is a complex problem.
Recently I’ve tried to find a way (there are many different approaches) to estimate the likeliness of having perceptually significative banding in a specific portion of a video.

I’m using an auto-correlation approach that is giving interesting preliminary results. So this “banding metric” analyzes only the final picture, without reference to source files (than, in case of mezzanine or sources you obviouly do not have anyway).

For example: here we have a short video sequence. When you watch at it in optimal viewing condition, you can spot some banding on flat areas. The content is quite dark (maybe you can spot someone of familiar in the background 😉 so, as usual, in the continuation I’ll show preferably the frames with exalted gamma.

The algorithm produces the following frame-by-frame report where an index of banding is expressed for each quadrant of the picture (Q1 = Top Left quadrant, Q2 = Top Right quadrant, Q3 = Bottom Left quadrant, Q4 = Bottom Right quadrant).

Below you can see the Frame 1 with exalted gamma. From the graph above, we see that the quadrant with higher banding likeliness is Q2. For the moment I’ve not yet calculated the most appropriate threashold for perceptually visible banding, but empirically it is near 0.98 (horizontal red line) . So in this frame, we have low likeliness to have banding and only a minor probability for Q2.

FRAME 1

In the frame below we have an incresing amount of banding, especially in Q1 but also in Q2 (on the tree and sky). The graph above shows an increasing probability of perceptually visible banding in quandrant Q1 and Q2 and infact they are above the threashold, while Q3 and Q4 are below.

FRAME 173

Then there’s a scene change, and for the new scene the graph reports an high probability of banding for quadrant Q1 and Q3 (click on the image below to zoom) an oscillating behaviour for Q2 (the hands are moving and the dark parts exibit banding in some parts of the scene) while the Q4 quadrant is completely immune from banding.

FRAME 225

Preliminary conclusions

Has discussed, it’s very important to start from the identification and the measurement of banding because if you can find it, you can correct encoding algorithms to better retain details and avoid introducing this annoying artifact. It’s also useful to analyze sources and reject them when any banding is found, otherwise any other consequent encoding will only worsen the problem. The journey to defeat banding is only at the beginning… wish me good luck 😉

Thoughts around VMAF, ContentAwareEncoding and no-ref metrics

 

Introduction

Content-Aware Encoding (CAE) and Context-Aware Delivery (CAD) represent the state-of-the-art in video streaming today, independently from the codec used. The industry has taken its time to metabolize these concepts but now they are definitely mainstream:

Every content is different and needs to be encoded differently. Contexts of viewing are different and need to be served differently. Optimization of a streaming service requires CAE and CAD strategies.

I’ve discussed several times these logics and the need for CAE and CAD strategies and I’ve implemented different optimizations for my clients during the years.

Speaking about Content-Aware Encoding, at the beginning we used empiric rules to determine a relationship between the characteristics of the source (eventually classified) and the encoding parameterization to achieve a satisfying level of quality at the minimum possible bitrate. The “quality metric” used to tune the algorithms was usually the compressionist’s perception (or that of a small team of Golden-Eyes) or more rarely a full-featured panel for subjective quality assessment. Following a classical optimization approach (read some thoughts here) we subdivided a complex “domain” like video streaming in subdomains, recursively trying to optimize them individually (and then jointly, if possible) and using human perception tests to guide the decisions.

More recently, the introduction of metrics with high correlation with Human Perception, like VMAF, have helped greatly in designing more accurate CAE models as well as in the verification of the actual quality delivered to clients. But, are all problems solved ? Can we now completely substitute expert’s eye and subjective tests with unexpensive and fast objective metrics correlated to human perception ? The answer is not simple. In my experience, yes & no. It depends on many factors and one of them is accuracy…

 

A matter of accuracy

In my career I’ve had the fortune and privilege to work with open-minded Managers, Executes and Partners who dared to exit their comfort zone to promote experiments, trials and bold ideas for the sake of quality, optimization and innovation. So in the last decade I’ve had the opportunity to work on a number of innovative and stimulating projects like:
various CAE deployments, studies on human perception to tune video encoding optimizations & filtering, definition of metrics similar to VMAF to train ML algorithms in most advanced CAE implementations and many others. In the continuation of the post I’d like to discuss some problems encountered in this never-ending pursuit for an optimal encoding pipeline.

When VMAF was released, back in 2016, I was intrigued and excited to use it to improve an existing CAE deployment for one of my main clients. If you can substitute an expensive and time consuming subjective panel with a scalable video quality tool, you can multiply the experiments around encoding optimization, video processing, new codecs or other creative ideas about video streaming. A repeatable quality measurement is also useful to “sell” a new idea, because you can demonstrate the benefits it can produce (especially if the metric is developed by Netflix and this brings immediate credit).

However since the beginning VMAF showed in my experiments some sub-optimal behaviors, at least in some scenarios. In particular, what I can even now recognize as the Achille’s heel for VMAF is the drop of accuracy in estimating perceptual quality in dark and/or flat scenes.

In CAE we try to use the minimum possible amount of bits to achieve a desired minimum level of quality. This incidentally brings to very low bitrates in low complexity, flat, scenes. On the other hand, any error in estimating the level of quantization, or target bitrate in such scenes may produce an important deterioration of quality, in particular may introduce a amount of “banding” artifact. Suddenly, a point of strength of CAE becomes a potential point of weakness because a standard CBR encoding could avoid banding in the same situation (nervertheless with a waste of bitrate).

Therefore, an accurate metric is necessary to cope with that problem. Banding is a plague for 8- bit AVC/HEVC encoding, but can appear also in 10-bit HEVC video, especially when the energy of the source is low (maybe because of multiple elaborations) and a wrong quantization level can completely eliminate higher, delicate, residual frequencies and cause banding.

If we use a metric like VMAF to tune a CAE algorithm we need to be careful in such situations and apply “margins” or re-train VMAF to increase the sensibility in such problematic cases (there are also other problematic cases like very grainy noise, but in those I see an underestimation of subjective quality, which is much less problematic to handle).

I’m in good company in saying that VMAF might be not the right choice for all scenarios because even YouTube in the Big Apple 2019 Conference pointed out that VMAF is often not able to recognize properly the presence of banding. 

youtube_band
Figure 1. VMAF overestimates quality on dark, flat, scenes

I could hypothesize that this behavior is probably due to the way quality has been assessed in VMAF, for example the distance of 2.5xH could reduce sensibility in those situations, but the problem is still present also in VMAF 4K where distance is 1.5xH so maybe is a weakness of the elementary metrics.

 

A case in 4K

Let’s analyze a specific case. Recently I’ve conducted a Subjective Quality test on 4K contents, both SDR and HDR/HLG. VMAF 4K is not tuned for HDR so I’ll limit my considerations to the SDR case. The subjective panel has been performed to tune a custom quality metric with support for HDR content that then has been used to train an ML-based CAE deployment for 4K SDR/HDR streaming.

The picture below shows a dark scene used in the panel. On the left you have the original source, on the right you have a strongly compressed version (click on picture to enlarge).

Figure 2. Source (left) vs Compressed (right). Click to Enlarge
Figure 3. Exalted gamma to show artifacts on encoded version. Click to Enlarge

In Figure 3 you can easily see that the image is very damaged. It’s full of banding and also motion (obviously not visible here) is affected, with “marmalade” artifact. However, VMAF reports an average score of 81.8 over 100, equivalent to 4 in 1to5 scale MOS, which overestimates the subjective quality.

The panel (globally 60 people, 9000+ scores , 1.5xH from 50” 4K display, DSIS methodology) reports a MOS of 3.2 which is still high in my opinion, while a small team of Golden EYE reported a more severe 2.3.

From our study, we find that variance in the opinion scores for such type of artifacts increases considerably, maybe because of different individual visual acuity and cultural aspects (not trained to recognize specific artifacts). But a Golden Eye recognizes immediately the poor quality and so also an important percentage of the audience (in our case 58% of the scores were 3 or below) will consider the quality not sufficient, especially for the expectation of 4K.

This is a classical problem of taking into consideration the mean when variance is high. VMAF provides also a Confidence Interval, that’s useful to take better decision but still the prediction has an overestimated “center” for the example above and at least 2 JND distant from the MeanOpinionScore (not to mention Golden Eye’s score).  

Anyway, below we can see the correlation between VMAF 4K and subjective evaluation in a subset of the SDR sequences. The points below the area delimited by red lines represent content in which the predicted quality is overestimated by VMAF. Any decision taken using such estimation may lead to a wrong decision and some sort of artefacts.

vmaf4k_scatterplot1
Figure 4. MOS vs VMAF 4K

 

Still a long journey ahead

VMAF is not a perfect tool, at least not yet. However, it has paved the way toward handy estimation of perceptual quality in a variety of scenarios. What we should do probably is to consider it for what it is: an important “step” in a still very long journey toward accurate and omni-comprehensive quality estimation.

For now, if VMAF is not accurate in your specific scenario, or if you need a different kind of sensitivity, you can re-train VMAF with other data,  change/integrate the elementary metrics or make your own metric that focuses on specific requirements (maybe less universal but more accurate in your specific scenario). You could also use an ensemble-like approach, mixing various estimators to mitigate the points of weakness.

I see also other open points to address in the future:
– better temporal masking
– different approach to pooling scores both in time and spatial domain
– extrapolation of quality in different viewing conditions

As a final consideration, I find YouTube’s approach very interesting. They are using no-reference metrics to estimate the quality of source and encoded videos. No-reference metrics are not bound to measure the perceptual degradation of a source-compressed couple of videos, but are designed to estimate the “absolute” quality of the compressed video alone, without access to the source.

I think they are not only interesting to estimate quality when the source is not accessible (or is costly to retrieve and use), like in monitoring of existing live services, but they will be useful also as internal metric for CAE algorithms.

In fact, modern encoding pipelines try often to trade fidelity to the source with “perceptual pleasantness” if this can save bandwidth. Using a no-reference metric instead of a full-reference metric could increase this behaviour similarly to what happened in super resolution passing from a more traditional cost function in DNN training to an “adversarial-style” cost function in GAN.

But this is another story…