In the previous post of this 2 parts series, I have analyzed the technical features of the codec VP9 and concluded that, technically speaking, VP9 has the basis to compete with HEVC in terms of encoding efficiency.
But, you know, theory is a different thing than reality and in video encoding a big part of the final efficiency is in the encoder implementation more than in the codec specification. In this regard VP9 is not an exception and what I see from my tests is that vpxenc (the open source, command line encoder provided by Google) is not yet fully mature and optimized for every scenarios. I’ll discuss about this latest distinction more over.
VP9 specification has many features that can be used to enhance perceptual-aware encoding (like “segmentation”, to modulate quantization and filters inside frames according to perception of different areas of each frame). But those features are not yet used in vpxenc and this is clearly visible in the results.
At the beinning of 2015 I evaluated the performance of several H265 encoders for my clients and published a quick summary of the advantages and problems I found in (that time) HEVC encoders compared to optimized H264. The main problem that emerged in that evaluation was the inefficiency of “Adaptive Quantization” and other psycovisual techniques implemented in the encoders under test. The situation has partially changed for HEVC encoders during last year (thanks to better psycovisual encoding, especially for x265) but grain and noise retantion, especially in dark areas, is always a challenge for codecs exploiting big “transformations” like H265 and, indeed VP9.
Vp9 today shows the same inefficiencies of HEVC 1 years and half ago. It is quite good in handling motion related complexity, thanks to advanced motion estimation and compensation and reconstructs with high fidelity low and medium spatial frequencies, but has difficulties in retaining very high frequencies. Fine film grain disappears even at medium bitrates and the “banding” artifact is very visible in flat areas, gradients and dark areas even at high bitrates. In this regard H264 is still much better, at least at medium-high bitrates. Those kinds of artifact are quite common on Youtube because they are using now VP9 everytime they can, so try by yourself a 1080p or 2160p video on Chrome and take a look at gradients and shadows.
The sad thing is that common quality metrics like PSNR, SSIM (but also the more sofisticated VQM) are more happy with a flat encoding than with a psyco-visually pleasant, but not exact – encoding, and at the end, VP9 may be superior in PSNR or SSIM to H264/H265 even in a comparison like that of Picture 2 below where is very evident the banding or “posterization” effect.
VP9 profile 2 – 10bit per component
Until now I’ve spoken about traditional 8bits/component encoding in H264, H265 and VP9. But vpxenc supports also a 10bits per component encoding known as VP9 profile 2.
Even if your content is at 8bit and everything remains BT.709 compliant, several studies has demonstrated that 10bit encoding is always capable of better quality/bitrate ratios thanks to higher internal accuracy. In particular the benefits are well visible in gradients and dark areas’ accuracy. See this example of VP9 8bit vs 10bit:
In the picture above we can see the better rendering of soft gradients when encoding at 10bits even if the source is 8bits. Grain (high freq, low power signal) is still not retained compared to the source but banding is pretty much reduced. Note also that in the case of VP9 profile 0 we need to increase the bitrate well above 3Mbps to have a good encoding of gradients (for 1080p) while at only 1Mbps the result is in this case sufficient when using profile 2.
The superiority of 10bits encoding has been always valid also for H264 (high10 profile), so why 10bits have started to gain momentum only with HDR and not before ?
The answear is “lack of players” on consumer’s devices. Let’s remember that H264 has become relatively early the standard in internet video only because Adobe decided to insert (at it’s own expense) a decoder inside Flash Player 9 (2007). This enabled a billion desktops to playback baseline, main and high AVC profile. Few know that originally it should support also high10 but a bug ruined the opportunity to actually use this function.
Apart this missed opportunity, H264 decoders on modern browsers, mobile devices, TVs, STBs are not capable to decode H264 high10 profile and the same is true for VP9.
Where is VP9 available now ?
Today VP9 is supported in lastest Chrome, Firefox, Opera (and Edge in preview) browsers on desktop (PC and Mac) and is supported in Android from version 4.4 on (software or hardware decoding depending by device). It is also available on an increasing number of Connected TV, but all the current (significative) decoders support only VP9 in mode 0, so 8bit.
The same problem is true for H265. On the mobile devices that support it, you can only deliver 8bit H265, but in this case it is also true that the large majority of 4K TVs support HEVC main10 profile as well.
So, when is convenient to use VP9 ?
The problem of “banding artefact” is directly proportional to the size of the display. It is irrelevant on small displays like that of smart phones and tablets. On laptop it starts to become visible and is pretty bad on big TVs.
So, concluding, I think that today VP9 is an interesting option for everyone who wants:
– The maximum quality-bitrate ratio on desktop even with some compromises in terms of quality. HEVC decoding will probably not appear on desktop for a long time, so VP9 is the only viable improvement over H264. The use case of live streaming can better fit the compromises.
– High efficiency on Android with a wide support base (Android >4.4). On an old, 100$ Android Phone I have, VP9 decoding works and HEVC not. Interesting option for markets of developing countries when bandwidth is scarce and Android has a bigger base than iOS.
If the current situation doesn’t change I doubt that players like Netflix will deliver high quality content on Desktop or TV using VP9 in profile 0, especially for 4K. And infact David Ronca of Netflix has said that they are evaluating VP9 especially to lower the level of access for mobile devices (they already use HEVC for HDR-10).
But fortunately the scenario is probably about to change quickly if it’s true that Youtube is planning to deliver HDR (=10bits) with VP9 during summer. This means that TVs with Vp9 profile 2 decoding capabilities are becoming a reality and this should open the way also for profile 2 on desktop browsers. In this case (and I’m optimistic), VP9 has really good chances to definitively become the successor of H.264 at least for Internet Video on Desktop and Android.
Remain to see what Apple will decide to do. In the while I’m starting to push VP9 in my strategies because Indeed I think that their choices are irrelevant. If we want to optimize a video delivery service it is increasingly clear that we will have to optimize for all 3 codecs.
A technical primer
VP9 is a modern video codec developed by Google as the successor of VP8. While VP8 was aimed at offering an open alternative di AVC (aka H.264), VP9 challenges the latest HEVC (aka H.265). Google follows with VP9 the same model of “open” codec used for VP8 (the fact to be really open and free from patents related threats is still object of debates) and this theoretically makes of VP9 an interesting alternative to HEVC which is burdened by unclear and unsettled claims by multiple patents holders and patent pools.
VP9 specification has been freezed in June 2013 but only recently it is starting to attract attention of players that want to optimize video distribution (Youtube has been the only big adopter during last year, but now also Netflix is evaluating to use it). This is because VP9’s and HEVC’s ecosystems have finally reached a minimum level of maturity and is now possible to do evaluations and comparison with a sufficient level of confidency.
In this short serie of blog posts I analyze VP9 and try to understand if it really deserves attention and why. In this first part we will take a look at the technical specifications compared to HEVC (analyzed in this previous post) and in the second part I’ll analyze the actual performances, limits and contexts in which is possible to use VP9 as a valid alternative to AVC or HEVC.
VP9 subdivides the picture in “super blocks”. Similarly to HEVC, in VP9 super blocks can be recursively divided in smaller blocks down to 4×4. Differently from HEVC that can subdivide only in square sub partitions (32×32, 16×16, 8×8) VP9 can also use not square partitions like 32×16, 8×16 and so on (the use of rectangular partitions stops subdivision in the quad-tree branch). Most decisions are taken at level 8×8 (“skip” signaling for example) and 4×4 is a special case of 8×8. prediction mode, reference frame, MV, transform types are specified at block level.
Like VP8, VP9 uses an 8bit arithmetic coding engine known as the bool-coder. It use a static per-frame statistical model compared to an adaptive stat model like cabac used in AVC/HEVC. For each frame, the more convenient statistical model is choosen from a pool of four.
Similarly to H265, VP9 uses 4 transform sizes: 32×32, 16×16, 8×8 and 4×4. Transformations are integer approximations of DCT (Discrete Cosine Transform) or DST (Discrete Sine Transform), a mix of the two are used depending by the type of frame and transform size. Coefficients are scanned with particular patterns (different from the zig-zag patterns of H26x codecs, but with the same logic).
VP9 uses 4 scaling factors: a couple for Luma DC and AC coefficients, and a couple for Chroma DC/AC. The set of quantizers are fixed at frame level, so there is no block-level QP adjustment contrary to AVC/HEVC (but the not mandatory feature “segmentation” should be able to achieve the same effect of an adaptive quantization).
VP9 supports also a special lossless mode that uses only a Walsh transform on 4×4 blocks.
Intra prediction is a bit less complex than what offered by HEVC. Intra prediction acts on transformation blocks and there are 8 directional prediction modes and 2 not-directional compared to the 35 modes of HEVC
VP9 uses 1/8th pel motion compensation (double the precision of AVC). A novel feature is the possibility to use normal, smooth or sharp 8th pel interpolation filter (+bilinear). The proper version of the filter can be changed at block level.
Because of patents VP9 doesn’t use bidirectional motion estimation and compensation, so each block has normally only a single forward motion vector. However VP9 uses “compound prediction” where there are two motion vectors and the two predictions averaged together. To avoid patents, “compound prediction” is enabled only on not visible frames (commonly referred as “AltRef”). AltRef can be “constructed” during decoding, are not visible but can be used later as references. Since it’s possible to anticipate in an AltRef a future frame and use it as reference in compound mode, VP9 officially has no B-frames but in fact it has something completely equivalent.
Motion vectors in a frame can point to one of three possible reference frames usually named Last, Golden and AltRef. Ref frame to be used is signaled at 8×8 granularity. The decoder holds a list of 8 reference frames (slots) from which Last, Golden and AltRef refs are choosen at frame level. After decoding, the current frame can (optionally) substitute one of the 8 slots in the pool. An interesting feature of VP9 is the possibility to scale down frames during encoding (not on iframes). Inter predictors and reference frames are scaled accordingly.
Motion vector prediction is similar in complexity to HEVC. A 2-entry list of predictor is build during encoding and decoding. The first predictor is based on surrounding blocks, the second on previous frame. In case of empty list a vector 0,0 is used. So for each block the bitstream can signal to use:
-the first predictor plus a delta
-the first predictor as is
-the second predictor as is
-simply use motion vector [0,0]
There are 3 possible filters at different strength. VP9 makes a flatness test at boundaries of blocks and if the result is higher than a threshold, one of filter is applied to conceil blockiness.
Segmentation groups together blocks with similar characteristics. It is possible to change some encoding techniques at group level. This feature is dedicated to implement encoding optimizations (including psycovisual optimizations) and require an active support in the encoder.
The standard VP9 (profile 0) supports only a 8bit – 4:2:0 color mode while the (optional for hardware) profile 1 supports also 4:2:0 / 4:4:4 and optional alpha. In August 2015 Google has released a new version of the reference encoder capable to support the new profile 2 profile 2 (10-12bit -4:2:0) and profile 3 (10-12bit -4:2:2 / 4:4:4 + alpha). Profile 2 is aimed at supporting HDR video in Youtube (expected for summer 2016).
VP9 compared to HEVC
From a technical point of view, VP9 appears to be very near to HEVC as potential efficiency. The actual performance depends by the efficiency of the real encoders, but VP9 has all the potentialities to reach (almost, see below) the same performance of HEVC.
VP9 is a bit sub-par in terms of intra frame prediction (less modes) and of entropy coding (static tables vs adaptive). HEVC appears also to have an higher number of modes and small strategies to reduce the cost of syntax and signaling as well as residuals but on the other end, VP9 has some interesting potentialities in psycovisual optimizations and rate-control thanks to segmentation and adaptive frame resolution.
We will see in the next post the level of efficiency now reached by VP9 encoder compared to AVC and HEVC and the level of maturity of the respective ecosystems.
Online Video: infancy, youth and maturity
Over the last decade the consumption of online video has undergone an exponential growth, but online video is as old as the Internet itself. Recently Dan Rayburn has published a blog post about the early history of the streaming media industry, an “era” (1995-2005) where pioneers started experimenting codecs, products and models for the distribution of video over the Internet.
But it’s only with the launch of Youtube (2005-2006) that online video started a really tumultuos growth to become the preminent portion of global IP traffic. The ride of online video has been so intense that today the traffic generated by video is more than 70% of the total Internet traffic, orders of magnitude higher than 10 years ago (and still growing…).
We can say that nowadays online video has entered a phase of maturity. It is a multi-billion business ran not only by giants tech companies like Youtube, Netflix, Facebook, Amazon, Hulu, Apple, Vevo, but also by a multitude of traditional broadcasters (BBC, HBO, Sky just to name a few) with their regional OTT services.
The pressure of competition is now really high and this will bring many benefits to end users on many fronts, even that of QoE’s optimization.
Why optimize video streaming ?
Infact, until very recently, no one really cared about video optimization. Like any business in its early stages it was more important to place on the market the right product (and then find a viable business model before running out of money) than anything else including optimization of QoE. Simply put: If it worked it was enough.
But now things have changed. It cannot simply “work”, user expectations are constantly growing and it’s increasingly harder to engage users (see graph below). In this scenario optimization of streaming is becoming a key technological factor to differentiate a service from competitors, increase the satisfaction / retention and reduce costs.
Source: Conviva CSR2015 -How Consumers Judge Their Viewing Experience
How to optimize ?
If it’s clear what are the reasons to invest in streaming optimization on the other hand it’s not so easy to find the right way(s) to accomplish it. Users push the play button and want only to watch their preferite video flowlessly. But we know that behind-the-scenes there’s a lot of work to do to maximize that user experience. It’s a tangle of codecs, streaming protocols, multiple DRMs and CDNs, advertising, interactions flows, personalized experieces and so on.
At the end of the story, users want the max possible quality through out the video, a fast start and zero rebuffering on every screen. It’s up to us to untangle the skein and fulfill those expectations.
The points to be optimized are many but, in my opinion, the three more important are:
1. Video encoding optimizations (Quality)
2. ABR streaming optimizations (Robustness of distribution)
3. Playback optimizations (Reliability of streaming, start time, other aspect of QoE)
I have touched those points many times in the last 8 years in several projects (optimization of encoding pipelines and/or codecs, optimization of streaming protocols and servers, optimization of players) or during conferences (see Adobe Max 2009 / 2010 / 2011) and I’ve made “online video optimization” one of my distictive competencies.
In general, the matter is complex, the variables are multiple and there are also many boundary conditions so there’s no single recipe. Maximize the QoE requires the coordination of “optimization campaigns” in each of the aforementioned areas.
This requires flexible instead of static approaches, open-mindness instead of dogmas, desire for excellence (both for consultant and customer, paradoxically not so common to find in the latter), but also a mix of scientific approach and inspiration, remembering always that success is in the detail.
Create coordinated optimization strategies in encoding, delivery chain, and players is very complex so in this article I want to talk mainly about encoding optimization. This topic has become hot recently because of this post on the Netflix’s blog. They call it “Per-Title Encode”, I call it “(Content) Adaptive Encoding”.
I have worked on this topic for many companies like for example NTT Data, Sky Italy, Intel Media (acquired by Verizon), EvntLive (acquired by Yahoo!) and lately Vevo. I recently co-authored this article on Vevo’s tech blog on how we have optimized encoding of 200.000+ videos in Vevo during 2015. I suggest to read that article to have an high level introduction of the next topic: Content Adaptive Encoding.
“All fixed set patterns are incapable of adaptability or pliability. The truth is outside of all fixed patterns” Bruce Lee
Encoding Video is a very complex process.There’s often the temptation to over-simplify complex things and encoding is not an exception. So usually everyone encode video with a predefined set of parameters that satisfy some requirements (usually quality and/or target bitrate). But why should we use a single set of parameters (resolution, bitrate, encoding profiles) when we have very different kind of video and/or playback conditions ?
Static solutions to complex problems are rarely capable to produce best results. If we have mutable conditions and mutable data we need to adapt to them if we want to get closer to the optimal solution.
To exemplify the concept let’s make a parallelism with the problem of “function approximation”. If we need to approximate an arbritary function (see picture below), how can we hope to have a useful solution using a single 0-order approximation (red line on the left) ? It is too coarse, and the error that we get using it is very high (at least in some situations, i.e. for x -> 0). It’s clear that a first order approximation would be better (green line on the left) but still sub-optimal. Like in many other situations it’s even more useful to partition the problem in smaller (simpler) ones, in this case also a set of simple 0-order approximations (red lines on the right) would be considerably better at estimating the function than the original, ultra simplified approach, not to mention a “set” of first-order approximations (green lines on the right).
The partioning of the problem’s domain helps to avoid over-simplifications
Making a parallelism between this problem and the encoding, approximate with a 0-order estimator is similar to encode everything with the same resolution-bitrate “mix” (a.k.a ABR ladder).
The one-fits-all solution is simple, but far away from being optimal. We must be “Adaptive” in the sense of elaborating dynamic strategies to optimize the system.
There are many ways to optimize encoding but my preferite is, like said above, to partition this multi-dimensional problem in to sub-domains or clusters. We have not to apply necessarily rigorous math, it’s often more a matter of common sense. If we have a complex problem, let’s try to break it down to simplier pieces, easier to solve.
For example, in the case of encoding for ABR, we have commonly video with different complexities (a first variable to analyze) and we watch video on different devices (a second variable to take into account). A static ladder (for ABR streaming) is usually designed for the worst case and like a 0-order provides a sub-optimal performance.
We know that low complexity videos (like talking heads or fixed camera videos) are indeed much easier to encode than complex videos (like sports or action movies). This is inherently related to the way modern codec compress video data. They exploit temporal and spatial redundancies. Simple motion can be predicted from past frames and high spatial frequencies are stripped away by quantization.
A low complexity content can be compressed much more than a complex one, and this with approximately the same perceptual quality.
This is a first partition we can apply to the problem. Let’s classificate the content according to the complexity and apply specific encoding setups to optimize the overall performance toward desired goals.
Do you want to save bandwidth globally ? Why not encode content at different bitrates according to their complexities ? You will have a consistent perceptual quality but savings in bandwith consumption, globally.
You want higher average quality ? In this case, let’s encode simplier content at higher resolutions compared to the resolution we would use using a single, static setup that’s usually calibrated on the worst case (which is high complexity).
Medium Complexity (click to enlarge): email@example.comMbps (left) vs 720p @2.0Mbps (right)
Finding the right recipe is not easy because things may get more complex if we go down in this process. For example, complexity is not a scalar property of a video but a local attribute (complexity can change frame by frame, or at least scene by scene). If we join this with the fact that we may have constraints set by other elements of the pipeline the logic with which we try to approximate the optimal solution may become complex.
Just to make an example, in ABR streaming we are usually forced to encode video in capped VBR (if not CBR) because of player’s heuristic (this is why I’ve said before that the “final” optimization would be to set coordinated optimization strategies for encoding, distribution and playback. You need usually an optimized player to handle VBR encodings).
So to improve the optimization level, we may need to consider not only the average complexity, but also the maximum complexity through-out the video and apply dynamic parameterizations accordingly. Furthermode, complexity may be spatial (high frequencies in the image due to nitid picture or noise) or temporal (high level of motion, more difficult to encode for traditional codecs based on motion estimation and compensation). Different complexities deserve different weights inside our “optimization function” and specific parameterizations.
Viewing Context-aware encoding
Another variable is represented by the viewing conditions. Why apply the same resolution-parameterization for the same level of bandwidth when the video is watched on quite different screens ? The human eye has a specific angular resolution, so small defects in the picture quality are not visible at high DPI (like that of a smartphone) while the same is not true for low DPI screens like that of a TV. Mix that with the variable distance of viewing and we have another set of variables that we can optimize encoding for.
Example of different sensitivity of vision. The pictures above simulate the playback of the same video at different screen sizes: approx a smartphone screen for the upper image and a tablet (double diagonal) in the lower, cropped image. The picture is the same, simply enlarged. Note that artefacts of encoding are very visible on the lower image, but much less in the upper.
Considering the different sensitivity to artifacts of the eye at different DPI we can optimize the ABR ladder with resolutions-bitrates-parameterizatons specifically choosen to conceal artifacts in specific viewing conditions.
There are other interesting aspects that enter in the mix of strategies that you can use.
I have no time to analyze them here, but they worth a mention:
– Multi-Codecs encoding: leverage the best codec available on each platform. ie. VP9 on Android / Chrome / FF, HEVC on 4K TVs and H264 every where else.
– VBR vs CBR: use VBR whenever possible. This requires custom player so i.e. is feasible today in DASH for Android and Browser but not for HLS in iOS. Will require multiple encodes but may worth the effort.
– Another interesting topic is the distance and number of renditions inside an ABR ladder. Different network conditions (i.e. mobile vs broadband) may require different setups.
– Special renditions: sometimes I have defined special renditions for special cases that may have specific goals and characteristics (i.e. special renditions to speed up initial buffering efforts).
Concluding, if we mix various strategies, the improvement in QoE and bandwidth consumption may be considerable. Consider that optimize quality/bitrate ratio generates always an increase in QoE both directly and indirectly. Infact, with giants like Netflix that monopolizes the bandwidth (40+% of Internet traffic in USA at peak times) the services that are not optimized will start to suffer (or probably are already suffering). ABR streaming cannot be used any longer an “alibi” for un-optimized encoding, it’s no longer sufficient to be in the market, you’ve to master technology, smooth edges and give the maximum to be competitive. It’s time to optimize.
In the first post of this mini serie I have analyzed the technical features of H.265 (aka HEVC) compared with the good old H.264 (aka AVC). Summarizing, HEVC pushes the traditional block-based video encoding paradigm to higher levels of efficiency (and also complexity from an encoding/decoding p.o.v.) thanks mainly to:
– variable size transforms (from 4×4 to 32×32)
– quad-tree structured prediction areas (from 64×64 to 4×4)
– candidate-list-based motion vector prediction
– many intra-frame predictions modes
– higher-accuracy filters for motion compensation
– optimized deblocking, SAO filtering, cabac, etc…
It’s interesting to note that, compared to any other previous step from H.261 to H.264, with H.265 we have a considerable improvement not only (or mainly) in inter-frame compression domain but in intra-frame compression as well. A consistent amount of data in H.264 streaming is today concentrated in i-frames and this is because intra-frame compression is considerably less “evoluted” compared to inter-frame where, for example, b-frames help a lot in compression. H.265 introduces a strong improvement in block compression (in any kind of frame) thanks to variable size transforms. The possibility to use smaller transforms for impulsive signals and bigger transforms for stationary signals (smooth areas in case of pictures) is not new in signal-processing discipline and is used for example in AAC and many other codecs. Variable size transforms increase compression efficiency but introduce also some new challenges…but let’s procede one step at a time.
Video encoding is a complex problem that is highly dependent on the content. It is well known that a low motion scene with static background and bright lights can be compressed much more than a high motion, dark action scene with most of the picture that is moving. So what are the most difficult scenes/situations that a modern codec like H.264 has to cope with ? Even an efficient encoder may still find difficulty in compressing:
– detailed keyframes: without references to count on (and with not so efficient intraframe prediction), compressing keyframe is still difficult especially when they are features-rich (ex: a forest). If the keyframe is at the beginning of a quiet scene, the high efficiency of motion predition and compensation on low motion allows for overall efficient compression (most of the data can be allocated on the keyframe), but a sudden increase in complexity (motion) during the GOP can easily push an encoder to crisis.
– high motion with “crisp” picture: predict high complexity motion is quite difficult in itself. Mix this with high spatial complexity and you will have a consistent spike in bitrate and/or an increasing amount of artifacts.
– slow motion in dark areas: encoding dark areas is challenging because eyes are more sensible to details in dark than in full light but if you add slow motions of textured objects or smoke or small changes in colors and shadows, it is quite easy to spot annoing artifacts even using adaptive quantizations or similar optimizations.
– noise/grain: noise is almost incompressible by definition (it’s random and “unpredictable” by nature). Fortunately eye is more sensible to grain and noise in specific areas of picture like flat areas and dark areas and less in bright and detailed areas so a smart encoder can move bit-budget where is more needed. Nonetheless it’s quite difficult di compress noisy content, especially noise in fast moving scenes. Compressed noise is easily spotted because creates ugly patterns at lower frequencies and interfere with motion estimation/compensation (“dragged” artifacts). Denoising is not always suitable and/or desired, and unfortunately noise modelling and reconstruction during playback continue to be an “option” in hevc specification (watch this experiment about syntetic grain reconstruction).
H.265 mitigates the fist two cases compared to H.264. As said above, it’s quite efficient in intra-frame encoding and so detailed area can be encoded well and also smooth areas and gradients. Even motion estimation and compensation is effective and so compared to H.264, H.265 is able to operate at much lower bitrates before the appearance of artifacts. Furthermore, the artifacts produced by H.265 are more “smooth” and the degradation of quality is more “armonious” and good looking even when encoding at very aggressive resolution/bitrate ratios.
However, every coin has a flip side, and the strength of H.265 may become a weakness when processing the last two problematic cases. Dark areas and noise/grain require a more accurate (not matematically but “perceptually”) retention of high frequencies and small changes in color levels. This is usually called psy-optimization of encoding. In H.264, that uses s small transform, is easier to turn a quantization error into features/details that are not identical to the original but perceptually “similar”. The error generated in the approximation of the original frequency domain is stopped by the small boundary of the transform and thus more controllable. In H.265 with bigger trasforms is much more complex to use this approach and new ideas have to be put on the table.
H.265 vs H.264 today
In the last years I have developed optimizations approaches that analyze the video specifically for complex sequences and optimize them (adaptive source filtering, adaptive encoding parameterizations, specific rate control optimizations). Today I’m working into porting such optimizations to H.265 and so I’m “playing” with several H.265 encoders (i.e.: Divx H.265, x265, f265, NTT H.265 enc)
For the reasons forementioned we are today (jan 2015) in a situation where a good H.265 encoder is superior to a good H.264 encoder in encoding feature-rich keyframes (and blocks in general) and high motion providing a much smoother degradation of quality over lower bitrates. But at the same time, a good H.264 is still able to provide the same quality or even better quality in dark areas and noisy/grainy pictures. When the playback is done on Mobile devices this is not much visible because of the high DPI, but on a big TV screen this is evident on complex sequences.
The picture below show you an examples of what I mean:
I’m not saying that H.264 “IS” better than H.265 but that today encoders show a not completely mature level of development. This is quite normal and expected, as in the past (2003-2005) it happened to H.264 compared to xvid or to the best MPEG2 encoders (especially when working at medium-high bitrates). The problem is present also in 4K, even if in this case it is slightly mitigated by pixel size. The necessity to offer a good quality even in complex situations force the content providers willing to stream in 4K to use higher bitrates than otherwise necessary. A partial way to mitigate the problem of dark areas is to use 10bit per color in compression instead of 8bit. The additional accuracy is usually able to provide a better perceptual quality. Also when encoding in H.264 the use of 10bit helped a lot but was almost impossible to use in production because of the lack of support in decoders.
Generally speaking, the quality we can achieve today with H.264 in 1080p @3-4Mbit/s can be matched (except for dark areas) by H.265 at around 2-2.5Mbit/s. But difficult areas are…difficult and this require much attention during compression. For example, my clients usually cannot accept “posterizations effects” and “banding artifacts” like the ones showed in the picture above, especially during full screen playback on big screens (eventually 4K TV sets).
Apart from the quality evaluation, the main problem of H.265 is the general availability of decoders today. For 4K streaming we can say that the majority of target devices (4K TV Set) are able to decode a main10 4K profile at least at 24-30Fps (but even 50-60Fps in most cases). Probably we will see soon HEVC also on iOS and Android because many SoC capable to decode HEVC are arriving on the market, but the situation is much problematic for the browsers. H.264 has started to spread the web only when it was supported by Flash Player in 2007 (and Adobe paid the license), now that Flash is out of the game the future of H.265 for the browser is much more uncertain. Google is pushing VP9 (free and already supported in Chrome) as the way to go for the browsers but I doubt that Firefox and IEx will support it and even if a next release of IE will support HEVC soon, an annoying fragmentation will continue to plague the video streaming over the Internet.
Fortunately the development of H.265 encoders is improving quite fastly. I’m planning to make the point on this topic every 6 months. Stay tuned.
HEVC is among us. On January 25, 2013, the ITU announced the completition of the first stage approval of the H.265 video codec standard and in the last 1 year several vendors/entities have started to work on the first implementations of H.265 encoders and decoders. Theoretically HEVC is said to be from 30 to 50% more efficient than H.264 (especially at higher resolutions) but is it really that simple ? is H.264 so close to retirement ? This is what we will try to find. First of all let’s start with a technical analysis of H.265 compared to AVC and then, in the next blog post, we will take a look at the current level of performance that is realistic to obtain in today’s H.265 encoders.
H.265/HEVC – Technical Overview
This part assumes you are sufficiently familiar with the coding techniques inplemented in H.264/AVC (if you need to refresh your memory I suggest those posts: H.264 Part I, Part II). HEVC re-uses many of the concept defined in H.264. Both are block based video encoding techniques so have the same roots and the same approach to encoding:
1. subdivision of picture in macroblocks, eventually sub-divided in blocks
2. reduction of spatial redundancy using intra-frame compression techniques
3. reduction of temporal redundancy using inter-frame compression techniques (motion estimation and compensation)
4. residual data compression using transformation & quantization
5. reduction of final redundancy in residuals and motion vectors transmission and signaling using entropy coding
HEVC can be seen as a strong evolution of AVC with some very important key features, a number of less important improvements and some simplifications.
Instead of 16×16 macroblocks like in AVC, HEVC divides pictures into “coding tree blocks” (CTBs). Depending by an encoding setting the size of the CTB can be of 64×64 or limited to 32×32 or 16×16. Several studies have shown that bigger CTBs provide higher efficiency (but also higher encoding time). Each CTB can be split recursively, in a quad-tree structure, in 32×32, 16×16 down to 8×8 sub-regions, called coding units (CUs). See the picture below for an example of partitioning of a 64×64 CTB (numbers report the scan order). Each picture is furtherly partitioned in special groups of CTBs called Slices and Tiles (see also Parallel processing)
CUs are the basic unit of prediction in HEVC. Usually smaller CUs are used around detailed areas (edges and so on), while bigger CUs are used to predict flat areas.
Each CU can be recursively splitted in Transform Units (TUs) with the same quad-tree approach used in CTBs. Differently from AVC that used mainly a 4×4 transform and occasionally an 8×8 transform, HEVC has several transform sizes: 32×32, 16×16, 8×8 and 4×4. From a matematical point of view, bigger TUs are able to encode better stationary signals while smaller TUs are better in encoding smaller “impulsive” signals. The transforms are based on DCT (Discrete Cosine Transform) but the transform used for intra 4×4 is based on DST instead (Discrete Sine Transform) because several tests have evidenced a small improvement in compression. Transformation is performed with higher accuracy compared to H.264. The adaptive nature of CBT, CU and TU partitions plus the higher accuracy plus the larger transform size are among the most important features of HEVC and the reason of the performance improvement compared to AVC. HEVC implements a sofisticated scan order and coefficient signaling scheme that improves signaling efficiency. Note that unlike H.264 there’s no Hadamard nor 2×2 chroma (min chroma transform size is 4×4). HEVC drops also the support for MBAFF or similar techniques to code interlaced video. Interlaced video can still be compressed but there’s no separation between fields and frames (only frames).
We have introduced the new transform sizes just after the picture partitioning to exploit the analogy between CU and TU trees, but before transform and quantization there’s the prediction phase (inter or intra).
A CU can be predicted using one of eight partition modes (see picture below).
Even if a CU contains one, two or four prediction units (PUs), it can be predicted using exclusively inter-frame or intra-frame prediction technique, furthermore Intra-coded CUs can use only the square partitions 2Nx2N or NxN. Inter-coded CUs can use both square and asymmetric partitions. A number of other limitations are applied to simplify signaling. For example no 4×4 prediction is allowed in inter-prediction and 4×8 and 8×4 are allowed only in forward prediction (so not in b-frames). Tendentially inter-prediction stops at 8×8 level.
HEVC has 35 different intra-prediction modes (9 in AVC). DC mode, Planar Mode and 33 directional modes. Like in AVC, intra prediction tries to recover information from surraunding blocks and works particularly well for flat areas. Intra prediction follows the TUs partition tree and so prediction modes are applied to 4×4, 8×8, 16×16 and 32×32 TUs.
For motion vector prediction HEVC has two reference lists: L0 and L1. They can hold 16 references each, but the maximum total number of unique pictures is 8. Multiple instance of the same ref frame can be stored with different weights. HEVC motion estimation is much more complex than in AVC. It uses list indexing. There are two main prediction modes: Merge and Advanced MV. Each PU can use one of those methods and can have forward (a MV) or bi-directional prediction (2 MV). In Advanced MV mode a list of candidates MV is created (spatial and temporal candidates picked with a complex, probabilistic logic), when the list is created only the best candidate index is transmitted in the bitstream plus the MV delta (difference between the real MV and the prediction). On the other side, the decoder will build and update continuously the same candidate list using the exact same rules used by the encoder and will pick-up the MV to use as estimator using the index sent by the encoder in the bitstream.
The merge mode is similar, the main difference is that the candidates list is calculated from neighboring MV and is not added to a delta MV. It is the equivalent of “skip” mode in AVC.
Similarly to AVC, HEVC specifies motion vectors in 1/4-pel, but uses an 8-tap filter for luma and a 4-tap 1/8-pel filter for chroma. This is considerably better than 6-tap used for luma and 2-tap (bilinear) for chroma used in AVC. An increased sub-pixel filtering accuracy improves efficiency of estimation and picture “stability” but requires much more memory accesses and so processing power (with higher battery consumption) this is why H.265 doesn’t include an inter-estimation on 4×4 regions, limits 4×8 and 8×4 estimation to be uni-directional (forward prediction) and limit to 8×8 for bi-directional. HEVC supports weighted prediction for both uni- and bi-directional PUs (always implicit weights).
HEVC uses up to 16bit per MV so at quarte-pel accuracy this means a −8192 to 8191.75 rang (for luma) compared to −2048 to 2047.75 horizontally and −512 to 511.75 vertically in AVC (increased motion compensation accuracy fo 4K 8K resolutions).
Unlike h264 where deblocking was performed on 4×4 blocks, in HEVC deblocking is performed on the 8×8 grid only. This allows for parallel processing of deblocking (there’s no filter overlapping). All vertical edges in the picture are deblocked first, followed by all horizontal edges. The filter is similar to AVC.
After deblocking there’s a second optional filter. This filter is called Sample Adaptive Offset, or SAO. Similarly to deblocking filter, it is applied in the prediction loop and the result stored in the reference frames list. The objective of the filter is to fix mispredictions, encoding drift and banding on wide areas subdividing the colors in “bands” and applying adaptive offset to them.
In HEVC threre’s only CABAC for entropy coding. CABAC in HEVC is almost identical to CABAC in AVC with minor changes and simplifications to allow a parallel decoding.
Since HEVC decoding is much more complex than AVC, several technique to allow a parallal decoding have been implemented. The most important are: Tiles and Wavefront.
The picture is divided into a rectangular grid of CTBs (Tiles). Motion vector prediction and intra-prediction is not performed across tile boundaries.
With Wavefront Each CTB row can be encoded & decoded by its own thread. Multiple rows encoding / decoding are sincronized (entropy coding state) guaranting that each “wavefront” CTB is surrounded by specific CTB during encoding and decoding (see picture).
The adaptive subdivision of picture in prediction areas, the use of advanced intra-prediction, inter-prediction and bigger transform sizes can absolutely guarantee, in the long term, a considerably higher efficiency of HEVC compared to AVC. But the complexity of the encoding is really much higher. For example, consider that in AVC a macroblock of 16×16 could have only 2 possible sub-partitions: 16 4×4 sub-blocks, or 4 8×8 sub-blocks. Now the number of possible sub-splitting of a 64×64 CTU is exceptionally higher (65536). In AVC was simple to test what of the two configurations was better for compression, but now ? New techniques must be implemented to efficiently explore the quad-tree and avoid to test every configuration out of the possible 65536.
Like AVC before, HEVC is a big optimization challenge, but the potentialities are enormous. In the next blog post we will take a look at the state of the art in H.265 encoding
in mid-2014 at the beginning of 2015.
I must admit, I’m feeling very guilty. This is the only new post in more than 1 year. 2013 has been wonderful from a professional point of view and I have had very few moments, if any, to dedicate to the blog. But for 2014 there are too many interesting trends that I can’t neglect anymore and so I want to return speaking about video encoding, streaming and OTT technologies.
Infact, you know that there are three magic “words” that are outlining the future of video: 4K, HEVC and DASH.
So, as a 2014 new year resolution, I’m planning to speak about ideas and optimizations related to the “magic trio”.
4K or not 4K ?
The first trend is rapidly gaining its momentum. “4K” is on every insiders’ lips and the effort of Youtube, Netflix and others to offer quickly 4K content is also opening new opportunities for selling 4K TVs and Monitors.
I’m focusing part of my researches in finding specific optimizations for H.264 encoding of 4K content. Infact I think that apart from marketing buzz, 4K will be served first using the well known H.264.
There are sereval optimizations to explore for 4K: for example custom quantization matrix, bias toward the use of 8×8 transform, changes in psyco visual optimizations, to name a few. 4K also pushes the limit of H.264 for motion compensation and estimation (too long MVs) creating several efficiency problems. But if is useful to optimized an HD and FullHD stream, it is much more crucial to super optimize a 4K stream because the level of bitrate that we are speaking about is difficult to have in Internet or to have consistently.
ABR streaming can help here but not as usual. Who can accept to watch a 2.5Mbit/s 720p rendition on a 80” 4K display because of low bandwidth on peak times ? (it is the same experience as watching a 360p video on a 40” screen from 1.5 mt of distance, try and tell me) Who buy a 4K wants 4K, no compromise. Further more, as Dan Rayburn underlined, there are few economic reasons to offer 4K because 4K delivery costs 3-4 times Full-HD. This is why I think that optimization is now more important than ever.
HEVC has been finally ratified. Like in 2003, when H.264 was ratified, now the encoders are very raw and inefficient and a lot of work is to be done, but the potentialities are all there. Theoretically HEVC is said to be from 30 to 50% more efficient than H.264 (higher efficiency at higher resolutions). So it is not a mistery that 4K and H.265 are seen as the winning couple. But the increase in pixel to be processed (8x passing from 1080p25/30 to 2160p50/60) and the complexity of the new codec (approx. 10x during encoding compared to H.264) do not draw a simple scenario with increses in required processing power up to a 80x factor. But hey…we are now like in 2003, we have maybe 10 years ahead to squeeze the max out of H.265, and this is very exciting. In thee while, H.264 still have some room for improvements and for at least a couple years will continue to be the king on the hill.
I have started to play with HEVC and probably the amount of time I’ll dedicate to experiment will increase steadily during 2014. By now I have collected interesting results. The bigger Block Transforms (not only 4×4 and 8×8 like in H.264 but also 16×16 and 32×32) plus some advanced deblocking and adaptive filtering are able to produce a much “smoother degradation” of quality when decreasing the bitrate, especially for high complexity scenes. On the other hand, the different handling of fine details is producing now less details retantion than H.264 and new approaches to psycovisual optimizations are all to be invented.
And VP9 ? Interesting technology, good potentiality. Will be successful? Hard to tell, until then I will continue to keep it under observation.
Last but not least there’s the new MPEG standard for ABR streaming MPEG DASH (Dynamic Adaptive Streaming over HTTP). HLS is spreading over various devices but at the same time the implementations are frequently bugged and without control. DASH on the other hand provides plenty of control and it is possible to change heuristic. This is very important to achieve an Higher-as-possible QoE (or QoS), a key factor in the future where CDNs’ cost per GB is flattening while viewers’ number and stream size/quality is increasing .
So stay tuned.
PART I – Introduction (revised 02-jul-2012)
PART II – Parameters and recipes (revised 02-jul-2012)
PART III – Encoding in H.264 (revised 02-jul-2012)
PART IV – FFmpeg for streaming (revised 02-jul-2012)
PART V – Advanced usage (revised, 19-oct-2012)
PART VI – Filtering (new, 19-oct-2012)
The fabulous world of FFmpeg filtering
Transcoding is not a “static” matter, it is dynamic because you may have in input a very wide range of content’s types and you may have to set encoding parameters accordingly (This is particularly true for user generated contents).
Not only, the elaborations that you need to do in a video project may go beyond a simple transcoding and involve a deeper capacity of analysis, handling and “filtering” of video files.
Let’s consider some examples:
1. you have input files of several resolutions and aspect ratios and you have to encode them to two target output formats (one for 16:9 and one for 4:3) . In this case you need to analyze the input file and decide what profile to apply depending by input aspect ratio.
2. now let’s suppose you want also to encode video at the target resolution only if the input has an equal or higher resolution and keep the original otherwise. Again you’d need some external logic to read the metadata of the input and setup a dedicated encoding profile.
3. sometime video needs to be filtered, scaled and filtered again. Like , for istance, deinterlacing, watermarking and denoising. You need to be able to specify a sequence of filtering and/or manipulation tasks.
4. everybody needs thumbnails generation, but it’s difficult to find a shot really representative of the video content. Grabbing shots only on scene changes may be far more efficient.
FFmpeg can satisfy these kinds of complex analysis, handling and filtering tasks even without an external logic using the embedded filtering engine ( -vf ). For very complex workflows an external controller is still necessary but filters come handy when you need to do the job straight and simple.
FFmpeg filtering is a wide topic because there are hundreds of filters and thousands of combinations. So, using the same “recipe” style of the previous articles of this series, I’ll try to solve some common problems with specific command line samples focused on filtering. Note that to simplify command lines I’ll omit the parameters dedicated to H.264 and AAc encoding. Take a look at previous articles for such informations.
1. Adaptive Resize
In FFmpeg you can use the -s switch to set the resolution of the output but this is a not flexible solution. Far more control is provided by the filter “scale”. The following command line scales the input to the desired resolution the same way as -s:
ffmpeg -i input.mp4 -vf "scale=640:360" output.mp4
But scale provides you also with a way to specifing only the vertical or horizontal resolution and calculate the other to keep the same aspect ratio of the input:
ffmpeg -i input.mp4 -vf "scale=640:-1" output.mp4
With -1 in the vertical resolution you delegate to FFmpeg the calculation of the right value to keep the same aspect ratio of input (default) or obtain the aspect radio specified with -aspect switch (if present). Unfortunately, depending by input resolution, this may end with a odd value or an even value witch is not divisable by 2 as requested by H.264. To enforce a “divisible by x” rule, you can simply use the emebedded expression evaluation engine:
ffmpeg -i input.mp4 -vf "scale=640:trunc(ow/a/2)*2" output.mp4
The expression trunc(ow/a/2)*2 as vertical resolution means: use as output height the output width (ow = in this case 640) divided for input aspect ratio and approximated to the nearest multiple of 2 (I’m sure most of you are practiced with this kind of calculation).
2. Conditional resize
Let’s go further and find a solution to the problem 2 mentioned above: how to skip resize if the input resolution is lower than the target ?
ffmpeg -i input.mp4 -vf "scale=min(640,iw):trunc(ow/a/2)*2" output.mp4
This command line uses as width the minimum between 640 and the input width (iw), and then scales the height to maintain the original aspect ratio. Notice that “,” may require to be escaped to “\,” in some shells.
With this kind of filtering you can easily setup a command line for massive batch transcoding that adapts smartly the output resolution to the target. Way to use the original resolution when lower than the target? Well, if you encode with -crf this may help you save alot of bandwidth!
SD content is always interlaced and FullHD is very often interlaced. If you encode for the web you need to deinterlace and produce a progressive video which is also easier to compress. FFmpeg has a good deinterlacer filter named yadif (yet another deinterlacing filter) which is more efficient than the standard -deinterlace switch.
ffmpeg -i input.mp4 -vf "yadif=0:-1:0, scale=trunc(iw/2)*2:trunc(ih/2)*2" output.mp4
This command deinterlace the source (only if it is interlaced) and then scale down to half the horizontal and vertical resolution. In this case the sequence is mandatory: always deinterlace prior to scale!
4. Interlacing aware scaling
Sometimes, especially if you work for ipTV projects, you may need to encode interlaced (this is because legacy STBs require interlaced contents and also because interlaced may have higher temporal resolution). This is simple, just add -tff or -bff (top field first or bottom field first) in the x264 parameters. But there’s a problem: when you start from a 1080i and want to go down to an interlaced SD output (576i or 480i) you need an interlacing aware scaling because a standard scaling will break the interlacing. No fear, recently FFmpeg has introduced this option in the scale filter:
ffmpeg -i input.mp4 -vf "scale=720:576:-1" output.mp4
The third optional flag of filter is dedicated to interlace scaling. -1 means automatic detection, use 1 instead to force interlacing scaling.
When seeking for an high compression ratio it is very useful to reduce the video noise of input. There are several possibilities, my preferite is the hqdn3d filter (high quality de-noising 3d filter)
ffmpeg -i input.mp4 -vf "yadif,hqdn3d=1.5:1.5:6:6,scale=640:360" output.mp4
The filter can denoise video using a spatial function (first two parameters set strength) and a temporal function (last two parameters). Depending by the type of source (level of motion) can be more useful the spatial or the temporal. Pay attention also to the order of the filters: deinterlace -> denoise -> scaling is usually the best.
6. Select only specific frames from input
Sometime you need to control which frames are passed to the encoding stage or more simply change the Fps. Here you find some useful usages of the select filter:
ffmpeg -i input.mp4 -vf "select=eq(pict_type,I)" output.mp4
This sample command filter out every frame that are not an I-frame. This is useful when you know the gop structure of the original and want to create in output a fast preview of the video. Specifing a frame rate for the output with -r accelerate the playback while using -vsync 0 will copy the pts from input and keep the playback real-time.
Note: The previous command is similar to the input switch -skip_frame nokey ( -skip_frame bidir drops b-frames instead during deconding, useful to speedup decoding of big files in special cases).
ffmpeg -i input.mp4 -vf "select=not(mod(n,3))" output.mp4
This command selects a frame every 3, so it is possible to decimate original framerate by an integer factor N, useful for mobile low-bitrate encoding.
7. Speed-up or slow-down the video
It is also funny to play with PTS (presentation time stamps)
ffmpeg -i input.mp4 -vf "setpts=0.5*PTS" output.mp4
Use this to speed-up your video of a factor 2 (frame are dropped accordingly), or this below to slow-down:
ffmpeg -i input.mp4 -vf "setpts=2.0*PTS" output.mp4
8. Generate thumbnails on scene changes
The filter thumbnail tries to find the most representative frames in the video. Good to generate thumbnails.
ffmpeg -i input.mp4 -vf "thumbnail,scale=640:360" -frames:v 1 thumb.png
A different way to achieve this is to use again select filter. The following command selects only frames that have more than 40% of changes compared to previous (and so probably are scene changes) and generates a sequence of 5 png.
ffmpeg -i input.mp4 -vf "select=gt(scene,0.4),scale=640x:360" -frames:v 5 thumb%03d.png
The world of FFmpeg filtering is very wide and this is only a quick and “filtered” view on this world. Let me know in the comments or on twitter (@sonnati) if you need more complex filters or have problems adventuring in this fabulous world 😉
PART I – Introduction (revised 02-jul-2012)
PART II – Parameters and recipes (revised 02-jul-2012)
PART III – Encoding in H.264 (revised 02-jul-2012)
PART IV – FFmpeg for streaming (revised 02-jul-2012)
PART V – Advanced usage (revised, 19-oct-2012)
PART VI – Filtering (new, 19-oct-2012)