In the previous post of this 2 parts series, I have analyzed the technical features of the codec VP9 and concluded that, technically speaking, VP9 has the basis to compete with HEVC in terms of encoding efficiency.
But, you know, theory is a different thing than reality and in video encoding a big part of the final efficiency is in the encoder implementation more than in the codec specification. In this regard VP9 is not an exception and what I see from my tests is that vpxenc (the open source, command line encoder provided by Google) is not yet fully mature and optimized for every scenarios. I’ll discuss about this latest distinction more over.
VP9 specification has many features that can be used to enhance perceptual-aware encoding (like “segmentation”, to modulate quantization and filters inside frames according to perception of different areas of each frame). But those features are not yet used in vpxenc and this is clearly visible in the results.
At the beinning of 2015 I evaluated the performance of several H265 encoders for my clients and published a quick summary of the advantages and problems I found in (that time) HEVC encoders compared to optimized H264. The main problem that emerged in that evaluation was the inefficiency of “Adaptive Quantization” and other psycovisual techniques implemented in the encoders under test. The situation has partially changed for HEVC encoders during last year (thanks to better psycovisual encoding, especially for x265) but grain and noise retantion, especially in dark areas, is always a challenge for codecs exploiting big “transformations” like H265 and, indeed VP9.
Vp9 today shows the same inefficiencies of HEVC 1 years and half ago. It is quite good in handling motion related complexity, thanks to advanced motion estimation and compensation and reconstructs with high fidelity low and medium spatial frequencies, but has difficulties in retaining very high frequencies. Fine film grain disappears even at medium bitrates and the “banding” artifact is very visible in flat areas, gradients and dark areas even at high bitrates. In this regard H264 is still much better, at least at medium-high bitrates. Those kinds of artifact are quite common on Youtube because they are using now VP9 everytime they can, so try by yourself a 1080p or 2160p video on Chrome and take a look at gradients and shadows.
The sad thing is that common quality metrics like PSNR, SSIM (but also the more sofisticated VQM) are more happy with a flat encoding than with a psyco-visually pleasant, but not exact – encoding, and at the end, VP9 may be superior in PSNR or SSIM to H264/H265 even in a comparison like that of Picture 2 below where is very evident the banding or “posterization” effect.
VP9 profile 2 – 10bit per component
Until now I’ve spoken about traditional 8bits/component encoding in H264, H265 and VP9. But vpxenc supports also a 10bits per component encoding known as VP9 profile 2.
Even if your content is at 8bit and everything remains BT.709 compliant, several studies has demonstrated that 10bit encoding is always capable of better quality/bitrate ratios thanks to higher internal accuracy. In particular the benefits are well visible in gradients and dark areas’ accuracy. See this example of VP9 8bit vs 10bit:
In the picture above we can see the better rendering of soft gradients when encoding at 10bits even if the source is 8bits. Grain (high freq, low power signal) is still not retained compared to the source but banding is pretty much reduced. Note also that in the case of VP9 profile 0 we need to increase the bitrate well above 3Mbps to have a good encoding of gradients (for 1080p) while at only 1Mbps the result is in this case sufficient when using profile 2.
The superiority of 10bits encoding has been always valid also for H264 (high10 profile), so why 10bits have started to gain momentum only with HDR and not before ?
The answear is “lack of players” on consumer’s devices. Let’s remember that H264 has become relatively early the standard in internet video only because Adobe decided to insert (at it’s own expense) a decoder inside Flash Player 9 (2007). This enabled a billion desktops to playback baseline, main and high AVC profile. Few know that originally it should support also high10 but a bug ruined the opportunity to actually use this function.
Apart this missed opportunity, H264 decoders on modern browsers, mobile devices, TVs, STBs are not capable to decode H264 high10 profile and the same is true for VP9.
Where is VP9 available now ?
Today VP9 is supported in lastest Chrome, Firefox, Opera (and Edge in preview) browsers on desktop (PC and Mac) and is supported in Android from version 4.4 on (software or hardware decoding depending by device). It is also available on an increasing number of Connected TV, but all the current (significative) decoders support only VP9 in mode 0, so 8bit.
The same problem is true for H265. On the mobile devices that support it, you can only deliver 8bit H265, but in this case it is also true that the large majority of 4K TVs support HEVC main10 profile as well.
So, when is convenient to use VP9 ?
The problem of “banding artefact” is directly proportional to the size of the display. It is irrelevant on small displays like that of smart phones and tablets. On laptop it starts to become visible and is pretty bad on big TVs.
So, concluding, I think that today VP9 is an interesting option for everyone who wants:
– The maximum quality-bitrate ratio on desktop even with some compromises in terms of quality. HEVC decoding will probably not appear on desktop for a long time, so VP9 is the only viable improvement over H264. The use case of live streaming can better fit the compromises.
– High efficiency on Android with a wide support base (Android >4.4). On an old, 100$ Android Phone I have, VP9 decoding works and HEVC not. Interesting option for markets of developing countries when bandwidth is scarce and Android has a bigger base than iOS.
If the current situation doesn’t change I doubt that players like Netflix will deliver high quality content on Desktop or TV using VP9 in profile 0, especially for 4K. And infact David Ronca of Netflix has said that they are evaluating VP9 especially to lower the level of access for mobile devices (they already use HEVC for HDR-10).
But fortunately the scenario is probably about to change quickly if it’s true that Youtube is planning to deliver HDR (=10bits) with VP9 during summer. This means that TVs with Vp9 profile 2 decoding capabilities are becoming a reality and this should open the way also for profile 2 on desktop browsers. In this case (and I’m optimistic), VP9 has really good chances to definitively become the successor of H.264 at least for Internet Video on Desktop and Android.
Remain to see what Apple will decide to do. In the while I’m starting to push VP9 in my strategies because Indeed I think that their choices are irrelevant. If we want to optimize a video delivery service it is increasingly clear that we will have to optimize for all 3 codecs.
In the first post of this mini serie I have analyzed the technical features of H.265 (aka HEVC) compared with the good old H.264 (aka AVC). Summarizing, HEVC pushes the traditional block-based video encoding paradigm to higher levels of efficiency (and also complexity from an encoding/decoding p.o.v.) thanks mainly to:
– variable size transforms (from 4×4 to 32×32)
– quad-tree structured prediction areas (from 64×64 to 4×4)
– candidate-list-based motion vector prediction
– many intra-frame predictions modes
– higher-accuracy filters for motion compensation
– optimized deblocking, SAO filtering, cabac, etc…
It’s interesting to note that, compared to any other previous step from H.261 to H.264, with H.265 we have a considerable improvement not only (or mainly) in inter-frame compression domain but in intra-frame compression as well. A consistent amount of data in H.264 streaming is today concentrated in i-frames and this is because intra-frame compression is considerably less “evoluted” compared to inter-frame where, for example, b-frames help a lot in compression. H.265 introduces a strong improvement in block compression (in any kind of frame) thanks to variable size transforms. The possibility to use smaller transforms for impulsive signals and bigger transforms for stationary signals (smooth areas in case of pictures) is not new in signal-processing discipline and is used for example in AAC and many other codecs. Variable size transforms increase compression efficiency but introduce also some new challenges…but let’s procede one step at a time.
Video encoding is a complex problem that is highly dependent on the content. It is well known that a low motion scene with static background and bright lights can be compressed much more than a high motion, dark action scene with most of the picture that is moving. So what are the most difficult scenes/situations that a modern codec like H.264 has to cope with ? Even an efficient encoder may still find difficulty in compressing:
– detailed keyframes: without references to count on (and with not so efficient intraframe prediction), compressing keyframe is still difficult especially when they are features-rich (ex: a forest). If the keyframe is at the beginning of a quiet scene, the high efficiency of motion predition and compensation on low motion allows for overall efficient compression (most of the data can be allocated on the keyframe), but a sudden increase in complexity (motion) during the GOP can easily push an encoder to crisis.
– high motion with “crisp” picture: predict high complexity motion is quite difficult in itself. Mix this with high spatial complexity and you will have a consistent spike in bitrate and/or an increasing amount of artifacts.
– slow motion in dark areas: encoding dark areas is challenging because eyes are more sensible to details in dark than in full light but if you add slow motions of textured objects or smoke or small changes in colors and shadows, it is quite easy to spot annoing artifacts even using adaptive quantizations or similar optimizations.
– noise/grain: noise is almost incompressible by definition (it’s random and “unpredictable” by nature). Fortunately eye is more sensible to grain and noise in specific areas of picture like flat areas and dark areas and less in bright and detailed areas so a smart encoder can move bit-budget where is more needed. Nonetheless it’s quite difficult di compress noisy content, especially noise in fast moving scenes. Compressed noise is easily spotted because creates ugly patterns at lower frequencies and interfere with motion estimation/compensation (“dragged” artifacts). Denoising is not always suitable and/or desired, and unfortunately noise modelling and reconstruction during playback continue to be an “option” in hevc specification (watch this experiment about syntetic grain reconstruction).
H.265 mitigates the fist two cases compared to H.264. As said above, it’s quite efficient in intra-frame encoding and so detailed area can be encoded well and also smooth areas and gradients. Even motion estimation and compensation is effective and so compared to H.264, H.265 is able to operate at much lower bitrates before the appearance of artifacts. Furthermore, the artifacts produced by H.265 are more “smooth” and the degradation of quality is more “armonious” and good looking even when encoding at very aggressive resolution/bitrate ratios.
However, every coin has a flip side, and the strength of H.265 may become a weakness when processing the last two problematic cases. Dark areas and noise/grain require a more accurate (not matematically but “perceptually”) retention of high frequencies and small changes in color levels. This is usually called psy-optimization of encoding. In H.264, that uses s small transform, is easier to turn a quantization error into features/details that are not identical to the original but perceptually “similar”. The error generated in the approximation of the original frequency domain is stopped by the small boundary of the transform and thus more controllable. In H.265 with bigger trasforms is much more complex to use this approach and new ideas have to be put on the table.
H.265 vs H.264 today
In the last years I have developed optimizations approaches that analyze the video specifically for complex sequences and optimize them (adaptive source filtering, adaptive encoding parameterizations, specific rate control optimizations). Today I’m working into porting such optimizations to H.265 and so I’m “playing” with several H.265 encoders (i.e.: Divx H.265, x265, f265, NTT H.265 enc)
For the reasons forementioned we are today (jan 2015) in a situation where a good H.265 encoder is superior to a good H.264 encoder in encoding feature-rich keyframes (and blocks in general) and high motion providing a much smoother degradation of quality over lower bitrates. But at the same time, a good H.264 is still able to provide the same quality or even better quality in dark areas and noisy/grainy pictures. When the playback is done on Mobile devices this is not much visible because of the high DPI, but on a big TV screen this is evident on complex sequences.
The picture below show you an examples of what I mean:
I’m not saying that H.264 “IS” better than H.265 but that today encoders show a not completely mature level of development. This is quite normal and expected, as in the past (2003-2005) it happened to H.264 compared to xvid or to the best MPEG2 encoders (especially when working at medium-high bitrates). The problem is present also in 4K, even if in this case it is slightly mitigated by pixel size. The necessity to offer a good quality even in complex situations force the content providers willing to stream in 4K to use higher bitrates than otherwise necessary. A partial way to mitigate the problem of dark areas is to use 10bit per color in compression instead of 8bit. The additional accuracy is usually able to provide a better perceptual quality. Also when encoding in H.264 the use of 10bit helped a lot but was almost impossible to use in production because of the lack of support in decoders.
Generally speaking, the quality we can achieve today with H.264 in 1080p @3-4Mbit/s can be matched (except for dark areas) by H.265 at around 2-2.5Mbit/s. But difficult areas are…difficult and this require much attention during compression. For example, my clients usually cannot accept “posterizations effects” and “banding artifacts” like the ones showed in the picture above, especially during full screen playback on big screens (eventually 4K TV sets).
Apart from the quality evaluation, the main problem of H.265 is the general availability of decoders today. For 4K streaming we can say that the majority of target devices (4K TV Set) are able to decode a main10 4K profile at least at 24-30Fps (but even 50-60Fps in most cases). Probably we will see soon HEVC also on iOS and Android because many SoC capable to decode HEVC are arriving on the market, but the situation is much problematic for the browsers. H.264 has started to spread the web only when it was supported by Flash Player in 2007 (and Adobe paid the license), now that Flash is out of the game the future of H.265 for the browser is much more uncertain. Google is pushing VP9 (free and already supported in Chrome) as the way to go for the browsers but I doubt that Firefox and IEx will support it and even if a next release of IE will support HEVC soon, an annoying fragmentation will continue to plague the video streaming over the Internet.
Fortunately the development of H.265 encoders is improving quite fastly. I’m planning to make the point on this topic every 6 months. Stay tuned.
HEVC is among us. On January 25, 2013, the ITU announced the completition of the first stage approval of the H.265 video codec standard and in the last 1 year several vendors/entities have started to work on the first implementations of H.265 encoders and decoders. Theoretically HEVC is said to be from 30 to 50% more efficient than H.264 (especially at higher resolutions) but is it really that simple ? is H.264 so close to retirement ? This is what we will try to find. First of all let’s start with a technical analysis of H.265 compared to AVC and then, in the next blog post, we will take a look at the current level of performance that is realistic to obtain in today’s H.265 encoders.
H.265/HEVC – Technical Overview
This part assumes you are sufficiently familiar with the coding techniques inplemented in H.264/AVC (if you need to refresh your memory I suggest those posts: H.264 Part I, Part II). HEVC re-uses many of the concept defined in H.264. Both are block based video encoding techniques so have the same roots and the same approach to encoding:
1. subdivision of picture in macroblocks, eventually sub-divided in blocks
2. reduction of spatial redundancy using intra-frame compression techniques
3. reduction of temporal redundancy using inter-frame compression techniques (motion estimation and compensation)
4. residual data compression using transformation & quantization
5. reduction of final redundancy in residuals and motion vectors transmission and signaling using entropy coding
HEVC can be seen as a strong evolution of AVC with some very important key features, a number of less important improvements and some simplifications.
Instead of 16×16 macroblocks like in AVC, HEVC divides pictures into “coding tree blocks” (CTBs). Depending by an encoding setting the size of the CTB can be of 64×64 or limited to 32×32 or 16×16. Several studies have shown that bigger CTBs provide higher efficiency (but also higher encoding time). Each CTB can be split recursively, in a quad-tree structure, in 32×32, 16×16 down to 8×8 sub-regions, called coding units (CUs). See the picture below for an example of partitioning of a 64×64 CTB (numbers report the scan order). Each picture is furtherly partitioned in special groups of CTBs called Slices and Tiles (see also Parallel processing)
CUs are the basic unit of prediction in HEVC. Usually smaller CUs are used around detailed areas (edges and so on), while bigger CUs are used to predict flat areas.
Each CU can be recursively splitted in Transform Units (TUs) with the same quad-tree approach used in CTBs. Differently from AVC that used mainly a 4×4 transform and occasionally an 8×8 transform, HEVC has several transform sizes: 32×32, 16×16, 8×8 and 4×4. From a matematical point of view, bigger TUs are able to encode better stationary signals while smaller TUs are better in encoding smaller “impulsive” signals. The transforms are based on DCT (Discrete Cosine Transform) but the transform used for intra 4×4 is based on DST instead (Discrete Sine Transform) because several tests have evidenced a small improvement in compression. Transformation is performed with higher accuracy compared to H.264. The adaptive nature of CBT, CU and TU partitions plus the higher accuracy plus the larger transform size are among the most important features of HEVC and the reason of the performance improvement compared to AVC. HEVC implements a sofisticated scan order and coefficient signaling scheme that improves signaling efficiency. Note that unlike H.264 there’s no Hadamard nor 2×2 chroma (min chroma transform size is 4×4). HEVC drops also the support for MBAFF or similar techniques to code interlaced video. Interlaced video can still be compressed but there’s no separation between fields and frames (only frames).
We have introduced the new transform sizes just after the picture partitioning to exploit the analogy between CU and TU trees, but before transform and quantization there’s the prediction phase (inter or intra).
A CU can be predicted using one of eight partition modes (see picture below).
Even if a CU contains one, two or four prediction units (PUs), it can be predicted using exclusively inter-frame or intra-frame prediction technique, furthermore Intra-coded CUs can use only the square partitions 2Nx2N or NxN. Inter-coded CUs can use both square and asymmetric partitions. A number of other limitations are applied to simplify signaling. For example no 4×4 prediction is allowed in inter-prediction and 4×8 and 8×4 are allowed only in forward prediction (so not in b-frames). Tendentially inter-prediction stops at 8×8 level.
HEVC has 35 different intra-prediction modes (9 in AVC). DC mode, Planar Mode and 33 directional modes. Like in AVC, intra prediction tries to recover information from surraunding blocks and works particularly well for flat areas. Intra prediction follows the TUs partition tree and so prediction modes are applied to 4×4, 8×8, 16×16 and 32×32 TUs.
For motion vector prediction HEVC has two reference lists: L0 and L1. They can hold 16 references each, but the maximum total number of unique pictures is 8. Multiple instance of the same ref frame can be stored with different weights. HEVC motion estimation is much more complex than in AVC. It uses list indexing. There are two main prediction modes: Merge and Advanced MV. Each PU can use one of those methods and can have forward (a MV) or bi-directional prediction (2 MV). In Advanced MV mode a list of candidates MV is created (spatial and temporal candidates picked with a complex, probabilistic logic), when the list is created only the best candidate index is transmitted in the bitstream plus the MV delta (difference between the real MV and the prediction). On the other side, the decoder will build and update continuously the same candidate list using the exact same rules used by the encoder and will pick-up the MV to use as estimator using the index sent by the encoder in the bitstream.
The merge mode is similar, the main difference is that the candidates list is calculated from neighboring MV and is not added to a delta MV. It is the equivalent of “skip” mode in AVC.
Similarly to AVC, HEVC specifies motion vectors in 1/4-pel, but uses an 8-tap filter for luma and a 4-tap 1/8-pel filter for chroma. This is considerably better than 6-tap used for luma and 2-tap (bilinear) for chroma used in AVC. An increased sub-pixel filtering accuracy improves efficiency of estimation and picture “stability” but requires much more memory accesses and so processing power (with higher battery consumption) this is why H.265 doesn’t include an inter-estimation on 4×4 regions, limits 4×8 and 8×4 estimation to be uni-directional (forward prediction) and limit to 8×8 for bi-directional. HEVC supports weighted prediction for both uni- and bi-directional PUs (always implicit weights).
HEVC uses up to 16bit per MV so at quarte-pel accuracy this means a −8192 to 8191.75 rang (for luma) compared to −2048 to 2047.75 horizontally and −512 to 511.75 vertically in AVC (increased motion compensation accuracy fo 4K 8K resolutions).
Unlike h264 where deblocking was performed on 4×4 blocks, in HEVC deblocking is performed on the 8×8 grid only. This allows for parallel processing of deblocking (there’s no filter overlapping). All vertical edges in the picture are deblocked first, followed by all horizontal edges. The filter is similar to AVC.
After deblocking there’s a second optional filter. This filter is called Sample Adaptive Offset, or SAO. Similarly to deblocking filter, it is applied in the prediction loop and the result stored in the reference frames list. The objective of the filter is to fix mispredictions, encoding drift and banding on wide areas subdividing the colors in “bands” and applying adaptive offset to them.
In HEVC threre’s only CABAC for entropy coding. CABAC in HEVC is almost identical to CABAC in AVC with minor changes and simplifications to allow a parallel decoding.
Since HEVC decoding is much more complex than AVC, several technique to allow a parallal decoding have been implemented. The most important are: Tiles and Wavefront.
The picture is divided into a rectangular grid of CTBs (Tiles). Motion vector prediction and intra-prediction is not performed across tile boundaries.
With Wavefront Each CTB row can be encoded & decoded by its own thread. Multiple rows encoding / decoding are sincronized (entropy coding state) guaranting that each “wavefront” CTB is surrounded by specific CTB during encoding and decoding (see picture).
The adaptive subdivision of picture in prediction areas, the use of advanced intra-prediction, inter-prediction and bigger transform sizes can absolutely guarantee, in the long term, a considerably higher efficiency of HEVC compared to AVC. But the complexity of the encoding is really much higher. For example, consider that in AVC a macroblock of 16×16 could have only 2 possible sub-partitions: 16 4×4 sub-blocks, or 4 8×8 sub-blocks. Now the number of possible sub-splitting of a 64×64 CTU is exceptionally higher (65536). In AVC was simple to test what of the two configurations was better for compression, but now ? New techniques must be implemented to efficiently explore the quad-tree and avoid to test every configuration out of the possible 65536.
Like AVC before, HEVC is a big optimization challenge, but the potentialities are enormous. In the next blog post we will take a look at the state of the art in H.265 encoding
in mid-2014 at the beginning of 2015.