Those who follow my Blog know that I’m specialized in optimization of video streaming. It’s a creative and challenging work, because efficiency in streaming is not only a matter of choosing the best codecs and protocols (because it is probable to have a very limited set to choose from) but it’s more important to have an open mind, a genuine passion for research and devotion to quality.
In short: It’s less important the tool you use than the optimized methods, expertise and vision that guide that tool.
With a bit of research and original approaches it is possible to achieve higher benifits from optimization of existing codecs than from adopting new codecs, especially when you optimize codecs, streaming protocols and playback strategies synergistically.
This is even more true today because we are experiencing a kind of “stall” in the adoption of HEVC, the current state-of-the-art codec, because of uncertainties linked to multiple patent-pools ambiguity and licensing cost. If HEVC continues to be hampered, the alternative will be to use AV1 but it is still under development and will require many years to spread across a fragmented market. I’m sure this scenario will change, soon or later, but that’s not an alibi to wait and continue to deliver unoptimized H.264 video (or VP9), even because the benchmark of the market, Netflix, it’s anything but motionless!
The benefits of that kind of synergic optimizations are also in the fact that they can be applied in large part to different codecs, so when a new efficient codec will become available, the same techniques will be adapted to the new comer and maybe new specific strategies will be invented.
I’ve already spoken about “adaptive encoding” logic in this blog post. Here I want only to summarize interesting trending techniques to optimize video streaming services that are gaining traction and that I’ve applied in recent years and I’m trying to improve continuously in my consultancies.
It has many names. Netflix calls it “Per-Title Encoding”, other call “Adaptive Encoding” or “Content-Aware encoding”. I’ve discussed it extensively here. This approach to optimization has gained traction after the article of Netflix about Per-Title Encoding, but it’s a technique yet to be fully exploited.
Streaming services are quite different and there are various ways to setup a Complexity-Aware encoding pipeline. There are simple ways to extimate complexity but also more complex metrics that take into account multiple variables and models of the HVS (Human Vision System). With such refined approach it is possible to control the level of quality delivered by the encoder and therefore optimize the encoding for a specific purpose.
Such approch can be implemented inside a codec (in loop optimization), or applied externally performing accurate analysis upfront the final encoding (usable only in vod encoding).
The more complex implementation is probably the one that uses Machine Learning to predict optimal parameterizations to achieve the desired result (I’ve worked on this for a client in the last 6 months. More on this in the coming posts…)
This is a variation of the former. The logic is essentially the same but it is not applied to the encoding (a “standard” encoder can be used), but it’s applied at streaming protocol level. The manifests can be manipulated to obtain specific performances according to the analysis of the “cross” qualities you have in the entire ABR set.
For example, if a segment in an HLS set has a too high quality (measurable with traditional objective metrics or with ML-guided metrics) it is possible (but not exactly easy) to manipulate the manifest so to alter how the player navigate across the renditions.
This approach requires accurate setting of the encoder to produce the desired range of qualities and the delivery is optimized at protocol/manifest level. Not straight-forward as the former but still interesting.
I’m not sure, but maybe I’ve forged a neologism! Perception-Aware Streaming. This refined optimization technique is something I’ve played with for a while. The “streaming” in the name is used to indicate that the technique involves both encoding and delivery. “Perception-Aware” indicates that the encoding and the delivery is performed taking into account of perceptual fenomena, and in particular the angular resolution of HVS.
Again I’ve already introduced the concept in a previous post. Essentially in this technique we create a super set of renditions. A sub set is calibrated for big screens, another sub set for tablet/laptop and a final sub set is calibrated for small screen devices.
Leveraging on a simplified model of HVS, angular resolution and known minimum distance from the screen, it is possible to conceal artefacts and provide an higher sense of detail and at the same time reduces the average bitrate especially on smaller screens. We can mix that with a variation of Complexity-Aware encoding to obtain a highly optimized encoding pipeline.
It’s an interesting topic. I’d like to write more about it if I find some time. I don’t exclude that in the next months I could write whitepapers on this topic for a couple of my clients since I’m going to apply this technique again, but in a more evolved form.
Optimize encoding with the forementioned techniques leads very often to VBR encoding (capped, controlled in some more sofisticated ways or also unconstrained sometimes). Such files require dedicated heuristic to properly execute Adaprive Bitrate Streaming.
In recent times, more optimized and efficient heuristics have emerged compared to the traditional bandwidth-based heuristics. Buffer-based or hybrid heuristics allow a much better exploitation of bandwidth, more resilience to bandwidth fluctuations and can cope easily with VBR renditions. Stay tuned to know more about it.
Taken alone, each of that optimization schema provides interesting benefits, but the real gain is when you can optimize them synergistically. Those strategies are strictly correlated and can strengthen each other and enable new levels of efficiency.
For example, complexity-aware encoding and perception-driven streaming are more efficient when you can encode in VBR, but VBR encoding requires a player with custom and optimized heuristic, in return, a custom heuristic not only can cope with VBR but can also implements more efficient ABR handling and contributes greatly to maximization of QoE.
In the previous post of this 2 parts series, I have analyzed the technical features of the codec VP9 and concluded that, technically speaking, VP9 has the basis to compete with HEVC in terms of encoding efficiency.
But, you know, theory is a different thing than reality and in video encoding a big part of the final efficiency is in the encoder implementation more than in the codec specification. In this regard VP9 is not an exception and what I see from my tests is that vpxenc (the open source, command line encoder provided by Google) is not yet fully mature and optimized for every scenarios. I’ll discuss about this latest distinction more over.
VP9 specification has many features that can be used to enhance perceptual-aware encoding (like “segmentation”, to modulate quantization and filters inside frames according to perception of different areas of each frame). But those features are not yet used in vpxenc and this is clearly visible in the results.
At the beinning of 2015 I evaluated the performance of several H265 encoders for my clients and published a quick summary of the advantages and problems I found in (that time) HEVC encoders compared to optimized H264. The main problem that emerged in that evaluation was the inefficiency of “Adaptive Quantization” and other psycovisual techniques implemented in the encoders under test. The situation has partially changed for HEVC encoders during last year (thanks to better psycovisual encoding, especially for x265) but grain and noise retantion, especially in dark areas, is always a challenge for codecs exploiting big “transformations” like H265 and, indeed VP9.
Vp9 today shows the same inefficiencies of HEVC 1 years and half ago. It is quite good in handling motion related complexity, thanks to advanced motion estimation and compensation and reconstructs with high fidelity low and medium spatial frequencies, but has difficulties in retaining very high frequencies. Fine film grain disappears even at medium bitrates and the “banding” artifact is very visible in flat areas, gradients and dark areas even at high bitrates. In this regard H264 is still much better, at least at medium-high bitrates. Those kinds of artifact are quite common on Youtube because they are using now VP9 everytime they can, so try by yourself a 1080p or 2160p video on Chrome and take a look at gradients and shadows.
The sad thing is that common quality metrics like PSNR, SSIM (but also the more sofisticated VQM) are more happy with a flat encoding than with a psyco-visually pleasant, but not exact – encoding, and at the end, VP9 may be superior in PSNR or SSIM to H264/H265 even in a comparison like that of Picture 2 below where is very evident the banding or “posterization” effect.
VP9 profile 2 – 10bit per component
Until now I’ve spoken about traditional 8bits/component encoding in H264, H265 and VP9. But vpxenc supports also a 10bits per component encoding known as VP9 profile 2.
Even if your content is at 8bit and everything remains BT.709 compliant, several studies has demonstrated that 10bit encoding is always capable of better quality/bitrate ratios thanks to higher internal accuracy. In particular the benefits are well visible in gradients and dark areas’ accuracy. See this example of VP9 8bit vs 10bit:
In the picture above we can see the better rendering of soft gradients when encoding at 10bits even if the source is 8bits. Grain (high freq, low power signal) is still not retained compared to the source but banding is pretty much reduced. Note also that in the case of VP9 profile 0 we need to increase the bitrate well above 3Mbps to have a good encoding of gradients (for 1080p) while at only 1Mbps the result is in this case sufficient when using profile 2.
The superiority of 10bits encoding has been always valid also for H264 (high10 profile), so why 10bits have started to gain momentum only with HDR and not before ?
The answear is “lack of players” on consumer’s devices. Let’s remember that H264 has become relatively early the standard in internet video only because Adobe decided to insert (at it’s own expense) a decoder inside Flash Player 9 (2007). This enabled a billion desktops to playback baseline, main and high AVC profile. Few know that originally it should support also high10 but a bug ruined the opportunity to actually use this function.
Apart this missed opportunity, H264 decoders on modern browsers, mobile devices, TVs, STBs are not capable to decode H264 high10 profile and the same is true for VP9.
Where is VP9 available now ?
Today VP9 is supported in lastest Chrome, Firefox, Opera (and Edge in preview) browsers on desktop (PC and Mac) and is supported in Android from version 4.4 on (software or hardware decoding depending by device). It is also available on an increasing number of Connected TV, but all the current (significative) decoders support only VP9 in mode 0, so 8bit.
The same problem is true for H265. On the mobile devices that support it, you can only deliver 8bit H265, but in this case it is also true that the large majority of 4K TVs support HEVC main10 profile as well.
So, when is convenient to use VP9 ?
The problem of “banding artefact” is directly proportional to the size of the display. It is irrelevant on small displays like that of smart phones and tablets. On laptop it starts to become visible and is pretty bad on big TVs.
So, concluding, I think that today VP9 is an interesting option for everyone who wants:
– The maximum quality-bitrate ratio on desktop even with some compromises in terms of quality. HEVC decoding will probably not appear on desktop for a long time, so VP9 is the only viable improvement over H264. The use case of live streaming can better fit the compromises.
– High efficiency on Android with a wide support base (Android >4.4). On an old, 100$ Android Phone I have, VP9 decoding works and HEVC not. Interesting option for markets of developing countries when bandwidth is scarce and Android has a bigger base than iOS.
If the current situation doesn’t change I doubt that players like Netflix will deliver high quality content on Desktop or TV using VP9 in profile 0, especially for 4K. And infact David Ronca of Netflix has said that they are evaluating VP9 especially to lower the level of access for mobile devices (they already use HEVC for HDR-10).
But fortunately the scenario is probably about to change quickly if it’s true that Youtube is planning to deliver HDR (=10bits) with VP9 during summer. This means that TVs with Vp9 profile 2 decoding capabilities are becoming a reality and this should open the way also for profile 2 on desktop browsers. In this case (and I’m optimistic), VP9 has really good chances to definitively become the successor of H.264 at least for Internet Video on Desktop and Android.
Remain to see what Apple will decide to do. In the while I’m starting to push VP9 in my strategies because Indeed I think that their choices are irrelevant. If we want to optimize a video delivery service it is increasingly clear that we will have to optimize for all 3 codecs.
A technical primer
VP9 is a modern video codec developed by Google as the successor of VP8. While VP8 was aimed at offering an open alternative di AVC (aka H.264), VP9 challenges the latest HEVC (aka H.265). Google follows with VP9 the same model of “open” codec used for VP8 (the fact to be really open and free from patents related threats is still object of debates) and this theoretically makes of VP9 an interesting alternative to HEVC which is burdened by unclear and unsettled claims by multiple patents holders and patent pools.
VP9 specification has been freezed in June 2013 but only recently it is starting to attract attention of players that want to optimize video distribution (Youtube has been the only big adopter during last year, but now also Netflix is evaluating to use it). This is because VP9’s and HEVC’s ecosystems have finally reached a minimum level of maturity and is now possible to do evaluations and comparison with a sufficient level of confidency.
In this short serie of blog posts I analyze VP9 and try to understand if it really deserves attention and why. In this first part we will take a look at the technical specifications compared to HEVC (analyzed in this previous post) and in the second part I’ll analyze the actual performances, limits and contexts in which is possible to use VP9 as a valid alternative to AVC or HEVC.
VP9 subdivides the picture in “super blocks”. Similarly to HEVC, in VP9 super blocks can be recursively divided in smaller blocks down to 4×4. Differently from HEVC that can subdivide only in square sub partitions (32×32, 16×16, 8×8) VP9 can also use not square partitions like 32×16, 8×16 and so on (the use of rectangular partitions stops subdivision in the quad-tree branch). Most decisions are taken at level 8×8 (“skip” signaling for example) and 4×4 is a special case of 8×8. prediction mode, reference frame, MV, transform types are specified at block level.
Like VP8, VP9 uses an 8bit arithmetic coding engine known as the bool-coder. It use a static per-frame statistical model compared to an adaptive stat model like cabac used in AVC/HEVC. For each frame, the more convenient statistical model is choosen from a pool of four.
Similarly to H265, VP9 uses 4 transform sizes: 32×32, 16×16, 8×8 and 4×4. Transformations are integer approximations of DCT (Discrete Cosine Transform) or DST (Discrete Sine Transform), a mix of the two are used depending by the type of frame and transform size. Coefficients are scanned with particular patterns (different from the zig-zag patterns of H26x codecs, but with the same logic).
VP9 uses 4 scaling factors: a couple for Luma DC and AC coefficients, and a couple for Chroma DC/AC. The set of quantizers are fixed at frame level, so there is no block-level QP adjustment contrary to AVC/HEVC (but the not mandatory feature “segmentation” should be able to achieve the same effect of an adaptive quantization).
VP9 supports also a special lossless mode that uses only a Walsh transform on 4×4 blocks.
Intra prediction is a bit less complex than what offered by HEVC. Intra prediction acts on transformation blocks and there are 8 directional prediction modes and 2 not-directional compared to the 35 modes of HEVC
VP9 uses 1/8th pel motion compensation (double the precision of AVC). A novel feature is the possibility to use normal, smooth or sharp 8th pel interpolation filter (+bilinear). The proper version of the filter can be changed at block level.
Because of patents VP9 doesn’t use bidirectional motion estimation and compensation, so each block has normally only a single forward motion vector. However VP9 uses “compound prediction” where there are two motion vectors and the two predictions averaged together. To avoid patents, “compound prediction” is enabled only on not visible frames (commonly referred as “AltRef”). AltRef can be “constructed” during decoding, are not visible but can be used later as references. Since it’s possible to anticipate in an AltRef a future frame and use it as reference in compound mode, VP9 officially has no B-frames but in fact it has something completely equivalent.
Motion vectors in a frame can point to one of three possible reference frames usually named Last, Golden and AltRef. Ref frame to be used is signaled at 8×8 granularity. The decoder holds a list of 8 reference frames (slots) from which Last, Golden and AltRef refs are choosen at frame level. After decoding, the current frame can (optionally) substitute one of the 8 slots in the pool. An interesting feature of VP9 is the possibility to scale down frames during encoding (not on iframes). Inter predictors and reference frames are scaled accordingly.
Motion vector prediction is similar in complexity to HEVC. A 2-entry list of predictor is build during encoding and decoding. The first predictor is based on surrounding blocks, the second on previous frame. In case of empty list a vector 0,0 is used. So for each block the bitstream can signal to use:
-the first predictor plus a delta
-the first predictor as is
-the second predictor as is
-simply use motion vector [0,0]
There are 3 possible filters at different strength. VP9 makes a flatness test at boundaries of blocks and if the result is higher than a threshold, one of filter is applied to conceil blockiness.
Segmentation groups together blocks with similar characteristics. It is possible to change some encoding techniques at group level. This feature is dedicated to implement encoding optimizations (including psycovisual optimizations) and require an active support in the encoder.
The standard VP9 (profile 0) supports only a 8bit – 4:2:0 color mode while the (optional for hardware) profile 1 supports also 4:2:0 / 4:4:4 and optional alpha. In August 2015 Google has released a new version of the reference encoder capable to support the new profile 2 profile 2 (10-12bit -4:2:0) and profile 3 (10-12bit -4:2:2 / 4:4:4 + alpha). Profile 2 is aimed at supporting HDR video in Youtube (expected for summer 2016).
VP9 compared to HEVC
From a technical point of view, VP9 appears to be very near to HEVC as potential efficiency. The actual performance depends by the efficiency of the real encoders, but VP9 has all the potentialities to reach (almost, see below) the same performance of HEVC.
VP9 is a bit sub-par in terms of intra frame prediction (less modes) and of entropy coding (static tables vs adaptive). HEVC appears also to have an higher number of modes and small strategies to reduce the cost of syntax and signaling as well as residuals but on the other end, VP9 has some interesting potentialities in psycovisual optimizations and rate-control thanks to segmentation and adaptive frame resolution.
We will see in the next post the level of efficiency now reached by VP9 encoder compared to AVC and HEVC and the level of maturity of the respective ecosystems.
Online Video: infancy, youth and maturity
Over the last decade the consumption of online video has undergone an exponential growth, but online video is as old as the Internet itself. Recently Dan Rayburn has published a blog post about the early history of the streaming media industry, an “era” (1995-2005) where pioneers started experimenting codecs, products and models for the distribution of video over the Internet.
But it’s only with the launch of Youtube (2005-2006) that online video started a really tumultuos growth to become the preminent portion of global IP traffic. The ride of online video has been so intense that today the traffic generated by video is more than 70% of the total Internet traffic, orders of magnitude higher than 10 years ago (and still growing…).
We can say that nowadays online video has entered a phase of maturity. It is a multi-billion business ran not only by giants tech companies like Youtube, Netflix, Facebook, Amazon, Hulu, Apple, Vevo, but also by a multitude of traditional broadcasters (BBC, HBO, Sky just to name a few) with their regional OTT services.
The pressure of competition is now really high and this will bring many benefits to end users on many fronts, even that of QoE’s optimization.
Why optimize video streaming ?
Infact, until very recently, no one really cared about video optimization. Like any business in its early stages it was more important to place on the market the right product (and then find a viable business model before running out of money) than anything else including optimization of QoE. Simply put: If it worked it was enough.
But now things have changed. It cannot simply “work”, user expectations are constantly growing and it’s increasingly harder to engage users (see graph below). In this scenario optimization of streaming is becoming a key technological factor to differentiate a service from competitors, increase the satisfaction / retention and reduce costs.
Source: Conviva CSR2015 -How Consumers Judge Their Viewing Experience
How to optimize ?
If it’s clear what are the reasons to invest in streaming optimization on the other hand it’s not so easy to find the right way(s) to accomplish it. Users push the play button and want only to watch their preferite video flowlessly. But we know that behind-the-scenes there’s a lot of work to do to maximize that user experience. It’s a tangle of codecs, streaming protocols, multiple DRMs and CDNs, advertising, interactions flows, personalized experieces and so on.
At the end of the story, users want the max possible quality through out the video, a fast start and zero rebuffering on every screen. It’s up to us to untangle the skein and fulfill those expectations.
The points to be optimized are many but, in my opinion, the three more important are:
1. Video encoding optimizations (Quality)
2. ABR streaming optimizations (Robustness of distribution)
3. Playback optimizations (Reliability of streaming, start time, other aspect of QoE)
I have touched those points many times in the last 8 years in several projects (optimization of encoding pipelines and/or codecs, optimization of streaming protocols and servers, optimization of players) or during conferences (see Adobe Max 2009 / 2010 / 2011) and I’ve made “online video optimization” one of my distictive competencies.
In general, the matter is complex, the variables are multiple and there are also many boundary conditions so there’s no single recipe. Maximize the QoE requires the coordination of “optimization campaigns” in each of the aforementioned areas.
This requires flexible instead of static approaches, open-mindness instead of dogmas, desire for excellence (both for consultant and customer, paradoxically not so common to find in the latter), but also a mix of scientific approach and inspiration, remembering always that success is in the detail.
Create coordinated optimization strategies in encoding, delivery chain, and players is very complex so in this article I want to talk mainly about encoding optimization. This topic has become hot recently because of this post on the Netflix’s blog. They call it “Per-Title Encode”, I call it “(Content) Adaptive Encoding”.
I have worked on this topic for many companies like for example NTT Data, Sky Italy, Intel Media (acquired by Verizon), EvntLive (acquired by Yahoo!) and lately Vevo. I recently co-authored this article on Vevo’s tech blog on how we have optimized encoding of 200.000+ videos in Vevo during 2015. I suggest to read that article to have an high level introduction of the next topic: Content Adaptive Encoding.
“All fixed set patterns are incapable of adaptability or pliability. The truth is outside of all fixed patterns” Bruce Lee
Encoding Video is a very complex process.There’s often the temptation to over-simplify complex things and encoding is not an exception. So usually everyone encode video with a predefined set of parameters that satisfy some requirements (usually quality and/or target bitrate). But why should we use a single set of parameters (resolution, bitrate, encoding profiles) when we have very different kind of video and/or playback conditions ?
Static solutions to complex problems are rarely capable to produce best results. If we have mutable conditions and mutable data we need to adapt to them if we want to get closer to the optimal solution.
To exemplify the concept let’s make a parallelism with the problem of “function approximation”. If we need to approximate an arbritary function (see picture below), how can we hope to have a useful solution using a single 0-order approximation (red line on the left) ? It is too coarse, and the error that we get using it is very high (at least in some situations, i.e. for x -> 0). It’s clear that a first order approximation would be better (green line on the left) but still sub-optimal. Like in many other situations it’s even more useful to partition the problem in smaller (simpler) ones, in this case also a set of simple 0-order approximations (red lines on the right) would be considerably better at estimating the function than the original, ultra simplified approach, not to mention a “set” of first-order approximations (green lines on the right).
The partioning of the problem’s domain helps to avoid over-simplifications
Making a parallelism between this problem and the encoding, approximate with a 0-order estimator is similar to encode everything with the same resolution-bitrate “mix” (a.k.a ABR ladder).
The one-fits-all solution is simple, but far away from being optimal. We must be “Adaptive” in the sense of elaborating dynamic strategies to optimize the system.
There are many ways to optimize encoding but my preferite is, like said above, to partition this multi-dimensional problem in to sub-domains or clusters. We have not to apply necessarily rigorous math, it’s often more a matter of common sense. If we have a complex problem, let’s try to break it down to simplier pieces, easier to solve.
For example, in the case of encoding for ABR, we have commonly video with different complexities (a first variable to analyze) and we watch video on different devices (a second variable to take into account). A static ladder (for ABR streaming) is usually designed for the worst case and like a 0-order provides a sub-optimal performance.
We know that low complexity videos (like talking heads or fixed camera videos) are indeed much easier to encode than complex videos (like sports or action movies). This is inherently related to the way modern codec compress video data. They exploit temporal and spatial redundancies. Simple motion can be predicted from past frames and high spatial frequencies are stripped away by quantization.
A low complexity content can be compressed much more than a complex one, and this with approximately the same perceptual quality.
This is a first partition we can apply to the problem. Let’s classificate the content according to the complexity and apply specific encoding setups to optimize the overall performance toward desired goals.
Do you want to save bandwidth globally ? Why not encode content at different bitrates according to their complexities ? You will have a consistent perceptual quality but savings in bandwith consumption, globally.
You want higher average quality ? In this case, let’s encode simplier content at higher resolutions compared to the resolution we would use using a single, static setup that’s usually calibrated on the worst case (which is high complexity).
Medium Complexity (click to enlarge): firstname.lastname@example.orgMbps (left) vs 720p @2.0Mbps (right)
Finding the right recipe is not easy because things may get more complex if we go down in this process. For example, complexity is not a scalar property of a video but a local attribute (complexity can change frame by frame, or at least scene by scene). If we join this with the fact that we may have constraints set by other elements of the pipeline the logic with which we try to approximate the optimal solution may become complex.
Just to make an example, in ABR streaming we are usually forced to encode video in capped VBR (if not CBR) because of player’s heuristic (this is why I’ve said before that the “final” optimization would be to set coordinated optimization strategies for encoding, distribution and playback. You need usually an optimized player to handle VBR encodings).
So to improve the optimization level, we may need to consider not only the average complexity, but also the maximum complexity through-out the video and apply dynamic parameterizations accordingly. Furthermode, complexity may be spatial (high frequencies in the image due to nitid picture or noise) or temporal (high level of motion, more difficult to encode for traditional codecs based on motion estimation and compensation). Different complexities deserve different weights inside our “optimization function” and specific parameterizations.
Viewing Context-aware encoding
Another variable is represented by the viewing conditions. Why apply the same resolution-parameterization for the same level of bandwidth when the video is watched on quite different screens ? The human eye has a specific angular resolution, so small defects in the picture quality are not visible at high DPI (like that of a smartphone) while the same is not true for low DPI screens like that of a TV. Mix that with the variable distance of viewing and we have another set of variables that we can optimize encoding for.
Example of different sensitivity of vision. The pictures above simulate the playback of the same video at different screen sizes: approx a smartphone screen for the upper image and a tablet (double diagonal) in the lower, cropped image. The picture is the same, simply enlarged. Note that artefacts of encoding are very visible on the lower image, but much less in the upper.
Considering the different sensitivity to artifacts of the eye at different DPI we can optimize the ABR ladder with resolutions-bitrates-parameterizatons specifically choosen to conceal artifacts in specific viewing conditions.
There are other interesting aspects that enter in the mix of strategies that you can use.
I have no time to analyze them here, but they worth a mention:
– Multi-Codecs encoding: leverage the best codec available on each platform. ie. VP9 on Android / Chrome / FF, HEVC on 4K TVs and H264 every where else.
– VBR vs CBR: use VBR whenever possible. This requires custom player so i.e. is feasible today in DASH for Android and Browser but not for HLS in iOS. Will require multiple encodes but may worth the effort.
– Another interesting topic is the distance and number of renditions inside an ABR ladder. Different network conditions (i.e. mobile vs broadband) may require different setups.
– Special renditions: sometimes I have defined special renditions for special cases that may have specific goals and characteristics (i.e. special renditions to speed up initial buffering efforts).
Concluding, if we mix various strategies, the improvement in QoE and bandwidth consumption may be considerable. Consider that optimize quality/bitrate ratio generates always an increase in QoE both directly and indirectly. Infact, with giants like Netflix that monopolizes the bandwidth (40+% of Internet traffic in USA at peak times) the services that are not optimized will start to suffer (or probably are already suffering). ABR streaming cannot be used any longer an “alibi” for un-optimized encoding, it’s no longer sufficient to be in the market, you’ve to master technology, smooth edges and give the maximum to be competitive. It’s time to optimize.
I must admit, I’m feeling very guilty. This is the only new post in more than 1 year. 2013 has been wonderful from a professional point of view and I have had very few moments, if any, to dedicate to the blog. But for 2014 there are too many interesting trends that I can’t neglect anymore and so I want to return speaking about video encoding, streaming and OTT technologies.
Infact, you know that there are three magic “words” that are outlining the future of video: 4K, HEVC and DASH.
So, as a 2014 new year resolution, I’m planning to speak about ideas and optimizations related to the “magic trio”.
4K or not 4K ?
The first trend is rapidly gaining its momentum. “4K” is on every insiders’ lips and the effort of Youtube, Netflix and others to offer quickly 4K content is also opening new opportunities for selling 4K TVs and Monitors.
I’m focusing part of my researches in finding specific optimizations for H.264 encoding of 4K content. Infact I think that apart from marketing buzz, 4K will be served first using the well known H.264.
There are sereval optimizations to explore for 4K: for example custom quantization matrix, bias toward the use of 8×8 transform, changes in psyco visual optimizations, to name a few. 4K also pushes the limit of H.264 for motion compensation and estimation (too long MVs) creating several efficiency problems. But if is useful to optimized an HD and FullHD stream, it is much more crucial to super optimize a 4K stream because the level of bitrate that we are speaking about is difficult to have in Internet or to have consistently.
ABR streaming can help here but not as usual. Who can accept to watch a 2.5Mbit/s 720p rendition on a 80” 4K display because of low bandwidth on peak times ? (it is the same experience as watching a 360p video on a 40” screen from 1.5 mt of distance, try and tell me) Who buy a 4K wants 4K, no compromise. Further more, as Dan Rayburn underlined, there are few economic reasons to offer 4K because 4K delivery costs 3-4 times Full-HD. This is why I think that optimization is now more important than ever.
HEVC has been finally ratified. Like in 2003, when H.264 was ratified, now the encoders are very raw and inefficient and a lot of work is to be done, but the potentialities are all there. Theoretically HEVC is said to be from 30 to 50% more efficient than H.264 (higher efficiency at higher resolutions). So it is not a mistery that 4K and H.265 are seen as the winning couple. But the increase in pixel to be processed (8x passing from 1080p25/30 to 2160p50/60) and the complexity of the new codec (approx. 10x during encoding compared to H.264) do not draw a simple scenario with increses in required processing power up to a 80x factor. But hey…we are now like in 2003, we have maybe 10 years ahead to squeeze the max out of H.265, and this is very exciting. In thee while, H.264 still have some room for improvements and for at least a couple years will continue to be the king on the hill.
I have started to play with HEVC and probably the amount of time I’ll dedicate to experiment will increase steadily during 2014. By now I have collected interesting results. The bigger Block Transforms (not only 4×4 and 8×8 like in H.264 but also 16×16 and 32×32) plus some advanced deblocking and adaptive filtering are able to produce a much “smoother degradation” of quality when decreasing the bitrate, especially for high complexity scenes. On the other hand, the different handling of fine details is producing now less details retantion than H.264 and new approaches to psycovisual optimizations are all to be invented.
And VP9 ? Interesting technology, good potentiality. Will be successful? Hard to tell, until then I will continue to keep it under observation.
Last but not least there’s the new MPEG standard for ABR streaming MPEG DASH (Dynamic Adaptive Streaming over HTTP). HLS is spreading over various devices but at the same time the implementations are frequently bugged and without control. DASH on the other hand provides plenty of control and it is possible to change heuristic. This is very important to achieve an Higher-as-possible QoE (or QoS), a key factor in the future where CDNs’ cost per GB is flattening while viewers’ number and stream size/quality is increasing .
So stay tuned.
PART I – Introduction (revised 02-jul-2012)
PART II – Parameters and recipes (revised 02-jul-2012)
PART III – Encoding in H.264 (revised 02-jul-2012)
PART IV – FFmpeg for streaming (revised 02-jul-2012)
PART V – Advanced usage (revised, 19-oct-2012)
PART VI – Filtering (new, 19-oct-2012)
The fabulous world of FFmpeg filtering
Transcoding is not a “static” matter, it is dynamic because you may have in input a very wide range of content’s types and you may have to set encoding parameters accordingly (This is particularly true for user generated contents).
Not only, the elaborations that you need to do in a video project may go beyond a simple transcoding and involve a deeper capacity of analysis, handling and “filtering” of video files.
Let’s consider some examples:
1. you have input files of several resolutions and aspect ratios and you have to encode them to two target output formats (one for 16:9 and one for 4:3) . In this case you need to analyze the input file and decide what profile to apply depending by input aspect ratio.
2. now let’s suppose you want also to encode video at the target resolution only if the input has an equal or higher resolution and keep the original otherwise. Again you’d need some external logic to read the metadata of the input and setup a dedicated encoding profile.
3. sometime video needs to be filtered, scaled and filtered again. Like , for istance, deinterlacing, watermarking and denoising. You need to be able to specify a sequence of filtering and/or manipulation tasks.
4. everybody needs thumbnails generation, but it’s difficult to find a shot really representative of the video content. Grabbing shots only on scene changes may be far more efficient.
FFmpeg can satisfy these kinds of complex analysis, handling and filtering tasks even without an external logic using the embedded filtering engine ( -vf ). For very complex workflows an external controller is still necessary but filters come handy when you need to do the job straight and simple.
FFmpeg filtering is a wide topic because there are hundreds of filters and thousands of combinations. So, using the same “recipe” style of the previous articles of this series, I’ll try to solve some common problems with specific command line samples focused on filtering. Note that to simplify command lines I’ll omit the parameters dedicated to H.264 and AAc encoding. Take a look at previous articles for such informations.
1. Adaptive Resize
In FFmpeg you can use the -s switch to set the resolution of the output but this is a not flexible solution. Far more control is provided by the filter “scale”. The following command line scales the input to the desired resolution the same way as -s:
ffmpeg -i input.mp4 -vf "scale=640:360" output.mp4
But scale provides you also with a way to specifing only the vertical or horizontal resolution and calculate the other to keep the same aspect ratio of the input:
ffmpeg -i input.mp4 -vf "scale=640:-1" output.mp4
With -1 in the vertical resolution you delegate to FFmpeg the calculation of the right value to keep the same aspect ratio of input (default) or obtain the aspect radio specified with -aspect switch (if present). Unfortunately, depending by input resolution, this may end with a odd value or an even value witch is not divisable by 2 as requested by H.264. To enforce a “divisible by x” rule, you can simply use the emebedded expression evaluation engine:
ffmpeg -i input.mp4 -vf "scale=640:trunc(ow/a/2)*2" output.mp4
The expression trunc(ow/a/2)*2 as vertical resolution means: use as output height the output width (ow = in this case 640) divided for input aspect ratio and approximated to the nearest multiple of 2 (I’m sure most of you are practiced with this kind of calculation).
2. Conditional resize
Let’s go further and find a solution to the problem 2 mentioned above: how to skip resize if the input resolution is lower than the target ?
ffmpeg -i input.mp4 -vf "scale=min(640,iw):trunc(ow/a/2)*2" output.mp4
This command line uses as width the minimum between 640 and the input width (iw), and then scales the height to maintain the original aspect ratio. Notice that “,” may require to be escaped to “\,” in some shells.
With this kind of filtering you can easily setup a command line for massive batch transcoding that adapts smartly the output resolution to the target. Way to use the original resolution when lower than the target? Well, if you encode with -crf this may help you save alot of bandwidth!
SD content is always interlaced and FullHD is very often interlaced. If you encode for the web you need to deinterlace and produce a progressive video which is also easier to compress. FFmpeg has a good deinterlacer filter named yadif (yet another deinterlacing filter) which is more efficient than the standard -deinterlace switch.
ffmpeg -i input.mp4 -vf "yadif=0:-1:0, scale=trunc(iw/2)*2:trunc(ih/2)*2" output.mp4
This command deinterlace the source (only if it is interlaced) and then scale down to half the horizontal and vertical resolution. In this case the sequence is mandatory: always deinterlace prior to scale!
4. Interlacing aware scaling
Sometimes, especially if you work for ipTV projects, you may need to encode interlaced (this is because legacy STBs require interlaced contents and also because interlaced may have higher temporal resolution). This is simple, just add -tff or -bff (top field first or bottom field first) in the x264 parameters. But there’s a problem: when you start from a 1080i and want to go down to an interlaced SD output (576i or 480i) you need an interlacing aware scaling because a standard scaling will break the interlacing. No fear, recently FFmpeg has introduced this option in the scale filter:
ffmpeg -i input.mp4 -vf "scale=720:576:-1" output.mp4
The third optional flag of filter is dedicated to interlace scaling. -1 means automatic detection, use 1 instead to force interlacing scaling.
When seeking for an high compression ratio it is very useful to reduce the video noise of input. There are several possibilities, my preferite is the hqdn3d filter (high quality de-noising 3d filter)
ffmpeg -i input.mp4 -vf "yadif,hqdn3d=1.5:1.5:6:6,scale=640:360" output.mp4
The filter can denoise video using a spatial function (first two parameters set strength) and a temporal function (last two parameters). Depending by the type of source (level of motion) can be more useful the spatial or the temporal. Pay attention also to the order of the filters: deinterlace -> denoise -> scaling is usually the best.
6. Select only specific frames from input
Sometime you need to control which frames are passed to the encoding stage or more simply change the Fps. Here you find some useful usages of the select filter:
ffmpeg -i input.mp4 -vf "select=eq(pict_type,I)" output.mp4
This sample command filter out every frame that are not an I-frame. This is useful when you know the gop structure of the original and want to create in output a fast preview of the video. Specifing a frame rate for the output with -r accelerate the playback while using -vsync 0 will copy the pts from input and keep the playback real-time.
Note: The previous command is similar to the input switch -skip_frame nokey ( -skip_frame bidir drops b-frames instead during deconding, useful to speedup decoding of big files in special cases).
ffmpeg -i input.mp4 -vf "select=not(mod(n,3))" output.mp4
This command selects a frame every 3, so it is possible to decimate original framerate by an integer factor N, useful for mobile low-bitrate encoding.
7. Speed-up or slow-down the video
It is also funny to play with PTS (presentation time stamps)
ffmpeg -i input.mp4 -vf "setpts=0.5*PTS" output.mp4
Use this to speed-up your video of a factor 2 (frame are dropped accordingly), or this below to slow-down:
ffmpeg -i input.mp4 -vf "setpts=2.0*PTS" output.mp4
8. Generate thumbnails on scene changes
The filter thumbnail tries to find the most representative frames in the video. Good to generate thumbnails.
ffmpeg -i input.mp4 -vf "thumbnail,scale=640:360" -frames:v 1 thumb.png
A different way to achieve this is to use again select filter. The following command selects only frames that have more than 40% of changes compared to previous (and so probably are scene changes) and generates a sequence of 5 png.
ffmpeg -i input.mp4 -vf "select=gt(scene,0.4),scale=640x:360" -frames:v 5 thumb%03d.png
The world of FFmpeg filtering is very wide and this is only a quick and “filtered” view on this world. Let me know in the comments or on twitter (@sonnati) if you need more complex filters or have problems adventuring in this fabulous world 😉
PART I – Introduction (revised 02-jul-2012)
PART II – Parameters and recipes (revised 02-jul-2012)
PART III – Encoding in H.264 (revised 02-jul-2012)
PART IV – FFmpeg for streaming (revised 02-jul-2012)
PART V – Advanced usage (revised, 19-oct-2012)
PART VI – Filtering (new, 19-oct-2012)
Netflix, during June, reached the record level of 1 Billion hours streamed in a month. It is an incredibly huge level of bandwidth, an impetuous and growing stream of bits that makes Netflix one of the TOP10 Internet bandwidth “consumer”. But how much does it cost to Netflix this huge stream ?
I remember an article of a couple years ago by Dan Rayburn in which he estimated an average cost of 3c$ per GByte, a low rate usually applied by CDNs to very large clients. In an article of 2011, Dan corrected the estimation discussing a more complex pricing model for such big players (a mix of per GB and per Gbit/s). The new estimation can, however, be approximated to 1.5c$/GB.
This level of pricing may seem very low and negligible in the overall Netflix’s business, but I think that the growing consumption due to the relatively high average of content streamed per user per month may be a problem for Netflix if not brought under control.
Let’s dig deeper in the numbers.
Let’s suppose that the average bitrate streamed to users is 2.4 Mbit/s (see this post in the netflix blog), this means that every hour of content requires in average 1080 MB (1GB).
If you multiply this for 1Billion hours you have 1 Billion GBs * 1.5c$ = 10M$ / month, 120 M$ per year.
Compared to the cost of CDNs of 2011, 2012 is around the double. This is caused by an increase in the number of clients but most of all by an increase in the average amount of data streamed per client. A wopping 90 minutes per day per user. I think that this may be considered near the maximum possible but a further increase to 120 minutes may be realistic in a worst case simulation. This would mean 160M$ per year.
You know that I’m very sensible to encoding optimization. I have always stated that for this kind of business encoding optimization is of fundamental importance. I have already demostrated in the past that H.264 can be optimized much more then what players like Youtube, Netflix, Hulu, BBC are doing today. Here I specifically addressed Youtube and Netflix.
Netflix could benefits of a 30% to 50% reduction in average bitrate consumption with a strong optimization of the entire encoding pipeline (plus eventually of the Silverlight player). This could mean savings for 60-80M$ per year and at the same time an improvement in the average quality delivered to client, a key feature in the increasingly competitive market of OTT video.