In the previous post of this 2 parts series, I have analyzed the technical features of the codec VP9 and concluded that, technically speaking, VP9 has the basis to compete with HEVC in terms of encoding efficiency.
But, you know, theory is a different thing than reality and in video encoding a big part of the final efficiency is in the encoder implementation more than in the codec specification. In this regard VP9 is not an exception and what I see from my tests is that vpxenc (the open source, command line encoder provided by Google) is not yet fully mature and optimized for every scenarios. I’ll discuss about this latest distinction more over.
VP9 specification has many features that can be used to enhance perceptual-aware encoding (like “segmentation”, to modulate quantization and filters inside frames according to perception of different areas of each frame). But those features are not yet used in vpxenc and this is clearly visible in the results.
At the beinning of 2015 I evaluated the performance of several H265 encoders for my clients and published a quick summary of the advantages and problems I found in (that time) HEVC encoders compared to optimized H264. The main problem that emerged in that evaluation was the inefficiency of “Adaptive Quantization” and other psycovisual techniques implemented in the encoders under test. The situation has partially changed for HEVC encoders during last year (thanks to better psycovisual encoding, especially for x265) but grain and noise retantion, especially in dark areas, is always a challenge for codecs exploiting big “transformations” like H265 and, indeed VP9.
Vp9 today shows the same inefficiencies of HEVC 1 years and half ago. It is quite good in handling motion related complexity, thanks to advanced motion estimation and compensation and reconstructs with high fidelity low and medium spatial frequencies, but has difficulties in retaining very high frequencies. Fine film grain disappears even at medium bitrates and the “banding” artifact is very visible in flat areas, gradients and dark areas even at high bitrates. In this regard H264 is still much better, at least at medium-high bitrates. Those kinds of artifact are quite common on Youtube because they are using now VP9 everytime they can, so try by yourself a 1080p or 2160p video on Chrome and take a look at gradients and shadows.
The sad thing is that common quality metrics like PSNR, SSIM (but also the more sofisticated VQM) are more happy with a flat encoding than with a psyco-visually pleasant, but not exact – encoding, and at the end, VP9 may be superior in PSNR or SSIM to H264/H265 even in a comparison like that of Picture 2 below where is very evident the banding or “posterization” effect.
VP9 profile 2 – 10bit per component
Until now I’ve spoken about traditional 8bits/component encoding in H264, H265 and VP9. But vpxenc supports also a 10bits per component encoding known as VP9 profile 2.
Even if your content is at 8bit and everything remains BT.709 compliant, several studies has demonstrated that 10bit encoding is always capable of better quality/bitrate ratios thanks to higher internal accuracy. In particular the benefits are well visible in gradients and dark areas’ accuracy. See this example of VP9 8bit vs 10bit:
In the picture above we can see the better rendering of soft gradients when encoding at 10bits even if the source is 8bits. Grain (high freq, low power signal) is still not retained compared to the source but banding is pretty much reduced. Note also that in the case of VP9 profile 0 we need to increase the bitrate well above 3Mbps to have a good encoding of gradients (for 1080p) while at only 1Mbps the result is in this case sufficient when using profile 2.
The superiority of 10bits encoding has been always valid also for H264 (high10 profile), so why 10bits have started to gain momentum only with HDR and not before ?
The answear is “lack of players” on consumer’s devices. Let’s remember that H264 has become relatively early the standard in internet video only because Adobe decided to insert (at it’s own expense) a decoder inside Flash Player 9 (2007). This enabled a billion desktops to playback baseline, main and high AVC profile. Few know that originally it should support also high10 but a bug ruined the opportunity to actually use this function.
Apart this missed opportunity, H264 decoders on modern browsers, mobile devices, TVs, STBs are not capable to decode H264 high10 profile and the same is true for VP9.
Where is VP9 available now ?
Today VP9 is supported in lastest Chrome, Firefox, Opera (and Edge in preview) browsers on desktop (PC and Mac) and is supported in Android from version 4.4 on (software or hardware decoding depending by device). It is also available on an increasing number of Connected TV, but all the current (significative) decoders support only VP9 in mode 0, so 8bit.
The same problem is true for H265. On the mobile devices that support it, you can only deliver 8bit H265, but in this case it is also true that the large majority of 4K TVs support HEVC main10 profile as well.
So, when is convenient to use VP9 ?
The problem of “banding artefact” is directly proportional to the size of the display. It is irrelevant on small displays like that of smart phones and tablets. On laptop it starts to become visible and is pretty bad on big TVs.
So, concluding, I think that today VP9 is an interesting option for everyone who wants:
– The maximum quality-bitrate ratio on desktop even with some compromises in terms of quality. HEVC decoding will probably not appear on desktop for a long time, so VP9 is the only viable improvement over H264. The use case of live streaming can better fit the compromises.
– High efficiency on Android with a wide support base (Android >4.4). On an old, 100$ Android Phone I have, VP9 decoding works and HEVC not. Interesting option for markets of developing countries when bandwidth is scarce and Android has a bigger base than iOS.
If the current situation doesn’t change I doubt that players like Netflix will deliver high quality content on Desktop or TV using VP9 in profile 0, especially for 4K. And infact David Ronca of Netflix has said that they are evaluating VP9 especially to lower the level of access for mobile devices (they already use HEVC for HDR-10).
But fortunately the scenario is probably about to change quickly if it’s true that Youtube is planning to deliver HDR (=10bits) with VP9 during summer. This means that TVs with Vp9 profile 2 decoding capabilities are becoming a reality and this should open the way also for profile 2 on desktop browsers. In this case (and I’m optimistic), VP9 has really good chances to definitively become the successor of H.264 at least for Internet Video on Desktop and Android.
Remain to see what Apple will decide to do. In the while I’m starting to push VP9 in my strategies because Indeed I think that their choices are irrelevant. If we want to optimize a video delivery service it is increasingly clear that we will have to optimize for all 3 codecs.
A technical primer
VP9 is a modern video codec developed by Google as the successor of VP8. While VP8 was aimed at offering an open alternative di AVC (aka H.264), VP9 challenges the latest HEVC (aka H.265). Google follows with VP9 the same model of “open” codec used for VP8 (the fact to be really open and free from patents related threats is still object of debates) and this theoretically makes of VP9 an interesting alternative to HEVC which is burdened by unclear and unsettled claims by multiple patents holders and patent pools.
VP9 specification has been freezed in June 2013 but only recently it is starting to attract attention of players that want to optimize video distribution (Youtube has been the only big adopter during last year, but now also Netflix is evaluating to use it). This is because VP9’s and HEVC’s ecosystems have finally reached a minimum level of maturity and is now possible to do evaluations and comparison with a sufficient level of confidency.
In this short serie of blog posts I analyze VP9 and try to understand if it really deserves attention and why. In this first part we will take a look at the technical specifications compared to HEVC (analyzed in this previous post) and in the second part I’ll analyze the actual performances, limits and contexts in which is possible to use VP9 as a valid alternative to AVC or HEVC.
VP9 subdivides the picture in “super blocks”. Similarly to HEVC, in VP9 super blocks can be recursively divided in smaller blocks down to 4×4. Differently from HEVC that can subdivide only in square sub partitions (32×32, 16×16, 8×8) VP9 can also use not square partitions like 32×16, 8×16 and so on (the use of rectangular partitions stops subdivision in the quad-tree branch). Most decisions are taken at level 8×8 (“skip” signaling for example) and 4×4 is a special case of 8×8. prediction mode, reference frame, MV, transform types are specified at block level.
Like VP8, VP9 uses an 8bit arithmetic coding engine known as the bool-coder. It use a static per-frame statistical model compared to an adaptive stat model like cabac used in AVC/HEVC. For each frame, the more convenient statistical model is choosen from a pool of four.
Similarly to H265, VP9 uses 4 transform sizes: 32×32, 16×16, 8×8 and 4×4. Transformations are integer approximations of DCT (Discrete Cosine Transform) or DST (Discrete Sine Transform), a mix of the two are used depending by the type of frame and transform size. Coefficients are scanned with particular patterns (different from the zig-zag patterns of H26x codecs, but with the same logic).
VP9 uses 4 scaling factors: a couple for Luma DC and AC coefficients, and a couple for Chroma DC/AC. The set of quantizers are fixed at frame level, so there is no block-level QP adjustment contrary to AVC/HEVC (but the not mandatory feature “segmentation” should be able to achieve the same effect of an adaptive quantization).
VP9 supports also a special lossless mode that uses only a Walsh transform on 4×4 blocks.
Intra prediction is a bit less complex than what offered by HEVC. Intra prediction acts on transformation blocks and there are 8 directional prediction modes and 2 not-directional compared to the 35 modes of HEVC
VP9 uses 1/8th pel motion compensation (double the precision of AVC). A novel feature is the possibility to use normal, smooth or sharp 8th pel interpolation filter (+bilinear). The proper version of the filter can be changed at block level.
Because of patents VP9 doesn’t use bidirectional motion estimation and compensation, so each block has normally only a single forward motion vector. However VP9 uses “compound prediction” where there are two motion vectors and the two predictions averaged together. To avoid patents, “compound prediction” is enabled only on not visible frames (commonly referred as “AltRef”). AltRef can be “constructed” during decoding, are not visible but can be used later as references. Since it’s possible to anticipate in an AltRef a future frame and use it as reference in compound mode, VP9 officially has no B-frames but in fact it has something completely equivalent.
Motion vectors in a frame can point to one of three possible reference frames usually named Last, Golden and AltRef. Ref frame to be used is signaled at 8×8 granularity. The decoder holds a list of 8 reference frames (slots) from which Last, Golden and AltRef refs are choosen at frame level. After decoding, the current frame can (optionally) substitute one of the 8 slots in the pool. An interesting feature of VP9 is the possibility to scale down frames during encoding (not on iframes). Inter predictors and reference frames are scaled accordingly.
Motion vector prediction is similar in complexity to HEVC. A 2-entry list of predictor is build during encoding and decoding. The first predictor is based on surrounding blocks, the second on previous frame. In case of empty list a vector 0,0 is used. So for each block the bitstream can signal to use:
-the first predictor plus a delta
-the first predictor as is
-the second predictor as is
-simply use motion vector [0,0]
There are 3 possible filters at different strength. VP9 makes a flatness test at boundaries of blocks and if the result is higher than a threshold, one of filter is applied to conceil blockiness.
Segmentation groups together blocks with similar characteristics. It is possible to change some encoding techniques at group level. This feature is dedicated to implement encoding optimizations (including psycovisual optimizations) and require an active support in the encoder.
The standard VP9 (profile 0) supports only a 8bit – 4:2:0 color mode while the (optional for hardware) profile 1 supports also 4:2:0 / 4:4:4 and optional alpha. In August 2015 Google has released a new version of the reference encoder capable to support the new profile 2 profile 2 (10-12bit -4:2:0) and profile 3 (10-12bit -4:2:2 / 4:4:4 + alpha). Profile 2 is aimed at supporting HDR video in Youtube (expected for summer 2016).
VP9 compared to HEVC
From a technical point of view, VP9 appears to be very near to HEVC as potential efficiency. The actual performance depends by the efficiency of the real encoders, but VP9 has all the potentialities to reach (almost, see below) the same performance of HEVC.
VP9 is a bit sub-par in terms of intra frame prediction (less modes) and of entropy coding (static tables vs adaptive). HEVC appears also to have an higher number of modes and small strategies to reduce the cost of syntax and signaling as well as residuals but on the other end, VP9 has some interesting potentialities in psycovisual optimizations and rate-control thanks to segmentation and adaptive frame resolution.
We will see in the next post the level of efficiency now reached by VP9 encoder compared to AVC and HEVC and the level of maturity of the respective ecosystems.