Adaptive Bitrate Video and its Inherent Latency
Adaptive Bitrate (ABR) and HTTP content delivery has been responsible for democratizing streaming media and bootstrapping an array of premium OTT services that deliver a rich quality of experience over Broadband and 4G/LTE/5G networks.
For the consumer, this is manifested as a seamless viewing experience with minimal buffering and resilient support for jitter, packet loss and firewall traversal. The ABR protocol transparently handles all this complexity based on device and network heuristics, meaning the viewer does not have to do anything to manage their experience; the media player simply retrieves the appropriate video quality that matches current network conditions.
ABR video is not one long stream of data, but numerous small fragments of video comprised of individual files. Each such file is buffered and assembled by the player to render a seamless stream. The user experience, particularly on high latency or lossy networks is further improved by encoding the video in longer fragments, sometimes up to 10 seconds in length, and buffering 3 – 4 fragments in the player.
Traditional ABR video formats such as HLS and DASH prioritize reliability at the expense of latency, which makes them less suitable for streaming live or latency-sensitive content over public networks and there can be tens of seconds between what is shown on the screen and when the action on the field occurred. A player might still be shooting the ball on OTT services while the team is already celebrating the goal on broadcast and IPTV. This is an issue for time-sensitive events like live sports where a viewer’s social media feed may be significantly ahead of their streaming service. This is also a massive issue for platforms offering live sports betting.
Figure 1: Stream Latency per Service
The underlying issue is that content fragments are created in real time, so an encoder will require 10 seconds + overhead to create a 10 second fragment. The fragment consists of metadata contained in a “moof” box and video samples in an “mdat” box. A player typically requires a buffer of at least 3 fragments before beginning to render the content to the viewer. After the addition of CDN latency, network overhead, and packet loss, the video stream may be 30 to 40 seconds behind. In many real-life deployments, the measured delay has been as much as 90 seconds.
Figure 2: MPEG BMFF Fragmented File Format
Attempts have been made to address some of these issues using alternative protocols such as WebRTC, which provides minimal latency of less than a second. The challenge is that WebRTC uses a non-reliable, best-effort network connection based on UDP that has been widely adopted for applications including video conferencing, but its characteristics make it detrimental when delivering live premium content services.
Chunked Transfer Encoding
In 2015, Apple and Microsoft began an unlikely collaboration that attempted to address several inefficiencies inherent in ABR delivery while ensuring that it natively supported a low latency mode that would provide similar functionality to WebRTC.
The outcome was a new ABR packaging construct called Common Media File Format (CMAF). Several other large and highly influential corporations like Google and Akamai quickly joined the effort and in February 2016, the group of companies prepared a joint submission to MPEG which was accepted onto a standardization track.
Figure 3: Low Latency CMAF Fragmentation
The Low Latency approach involves creating much shorter ABR fragments. The implementation is based on a concept defined in the HTTP/1.1 specification known as “chunked transfer encoding” which allows video fragments to be split into a series of smaller chunks. Further disaggregating fragments in this manner allows the encoding and packaging process to be highly optimized as the encoder can output portions of a fragment immediately after encoding it without having to create a full fragment. The additional moof and mdat boxes do not add any significant overhead to the overall encoding time or bandwidth required to deliver.
This has a massive impact on overall encoding latency. Assuming a 10 second fragment encoded at 29.97 FPS is created with 4 video samples per chunk, the video samples will be pushed to the delivery network after only 0.133 seconds of encoding time, 9.86 seconds faster than a stream without Chunked Encoding. As the chunks are created, the encoder POSTS the chunks to the CDN origin server using the chunked transfer encoding protocol. Since the encoder does not know the final size of the object being sent, it makes a single POST for the initial chunk, setting the content length to null in the header. Subsequent chunks are sent down the open connection to the server as they are created. The POST operation is terminated when the encoder sends a zero-length chunk. This operation is illustrated in figure 5 below.
Chunked transfer can be further enhanced by optimizing and fine-tuning player functionality. The illustration below shows a player joining a standard ABR stream at fragment 204 on the timeline. Since this fragment had not completed encoding, achieving the best possible latency would mean instantiating the stream beginning with fragment 203 introducing an additional 3 seconds of latency (assuming 2 second fragment duration). In reality, players may require as much as a 3-fragment buffer and, in this case, the player may start at fragment 201, introducing latencies as high as 7 seconds within the device itself.
Figure 4: Live Streaming Timeline (ABR). Latency is ~ 7 seconds
In a chunked transfer stream where fragments are further disaggregated into 500ms chunks, the player would be able to start playing at chunk A in fragment 204, even though fragment 204 is still being created, thus reducing latency to 1 second. This can be further optimized to 500ms if the player decodes through chunk A to retrieve the iframe information and begins rendering chunk B on the device.
Figure 5: Chunked Transfer Encoding Latency Improvements. Latency is 500 ms-1 second
Finally, by using a technique called “Deferred Playback” a player can wait 1 second and request chunk A from segment 204 when it is created. While this method introduces a small hit in terms of stream start-up, it would result in ~500ms latency in the stream.
Figure 6: Deferred Playback. Latency is 500 ms
In addition to participating in the CMAF initiative, Apple has also announced a low latency version of their own streaming protocol, Http Live Stream (HLS).
Low Latency HLS (LL-HLS) also leverages HTTP/1.1 chunked transport encoding, allowing fragments to be downloaded as they are being created. The approach is very similar to the chunked CMAF approach described above, the primary difference being that HLS leverages MPEG Transport Streams (MPEG2-TS) which natively provides 188 byte disaggregated chunks.
HTTP/2 Server Push is also used and allows fragments to be sent to a compliant client before the client requests them. Server Push can reduce latency by loading resources preemptively, even before the client knows they will be needed.
LL-HLS also predicts and pre-announces fragments in the manifest file prior to them being available. A player can therefore anticipate which fragments need to be loaded and seamlessly request fragments right after the previous one was downloaded. Pre-announcing fragments in this manner allows players which are not LL-HLS compliant to play the stream as if it were a normal HLS stream and gain some level of latency improvement.
Figure 7: LL-HLS Content Manifest
Last Mile Network Considerations
While CMAF, LL-HLS and WebRTC can all improve overall glass to glass latency, it should be remembered that these protocols are not being deployed in a vacuum and latency and packet loss in the last mile network will also have a huge impact.
Buffers in the last mile network layer manage variability in available network bandwidth and prevent high packet loss when available bandwidth is abruptly decreased within networks. Mobile networks in particular, often rely on deep packet buffers. TCP often creates high occupancy within these buffers, which leads to a decrease in sending rate and increased end-to-end latencies. This high latency also makes TCP slow to respond to network changes, as its sending rates are governed by end-to-end feedback, and compounds poor QoE for latency-sensitive video services.
Buffers are also used in player applications to provide a mechanism to queue fragments and alleviate the results of packet loss and jitter. If packets arrive at irregular intervals the buffer in the player can quickly fill due to an influx of data and become full, resulting in more packet loss within the client and adding to congestion and latency. If the buffer in the player is starved of video frames due to congestion, the player stalls or freezes until new packets are received and the buffer is refreshed with new frames.
This is a distinct concern with low latency streaming formats such as CMAF and LL-HLS that typically store relatively few and shorter fragments in the player buffer when compared to traditional ABR protocols. By reducing the latency and playing back buffered content almost immediately there is a higher risk of rebuffering and the video playback becoming unstable.
Network loss and congestion therefore must be considered when improving overall end-to-end latency. In order to optimize the transition to low latency streaming a more modern congestion control mechanism should be considered. These allow the reduction of delay with minimal adverse impact on QoE, especially in the parameters of quality stability and rebuffering. A good example of such a modern congestion control is the Performance-oriented Congestion Control (PCC). PCC considers multiple parameters including packet loss and latency among others. It can leverage observed empirical network data to improve throughput and limit packet loss and delay.
Unlike TCP, PCC makes no assumptions about the network and does not attempt to extrapolate the network parameters that are the potential cause of bottlenecks or packet loss. Instead, PCC repeatedly executes the following procedure: Send data to a client at a given rate (x) for a short period of time. As clients respond with ACK’s (or not as the case may be) the algorithm aggregates these into a Utility Value, which can be viewed as a performance score associated with sending at the selected rate (x) and incorporates both the send rate and any associated delays or packet loss. PCC now has real-world data that allows it to re-set its sending rate. Observations then continue as part of the video stream to determine the optimal rate for a particular user session.
Summary
CMAF or LL-HLS using Chunk Transfer Encoding are both ideal applications to reach large audiences with acceptable glass to glass latencies in the 3 to 4 second range. This provides an experience that is very similar to legacy cable or IPTV. Both can also be tuned successfully to handle latencies from 1 to 2 seconds, but it is extremely difficult to get latencies below that without impacting scale and viewer reach.
CMAF is more mature and is broadly available in most encoder, CDN and player applications. LL-HLS is a newer format. It was announced at the Apple Worldwide Developers Conference in 2019 which means that today fewer devices are LL-HLS compliant. CDN’s must also fully support HTTP/2 Server Push to take full advantage of the specification. Many do not. However, now that LL-HLS is part of the HLS spec, that scenario is changing and support in macOS, tvOS 14, iOS 14, and WatchOS 7 is already available.
The primary advantage of CMAF and LL-HLS is that they both scale to Olympic and Super Bowl type events while delivering low latency and are extremely suited to applications targeting large audiences where cost consideration is a factor.
Both protocols will have issues performing if the network is not optimized. Deploying the latest streaming protocol as a point solution could have limited impact on the overall user experience. In fact, deploying new technology in a vacuum could even have a detrimental effect. It’s easy, particular for content owners and application vendors, to forget the presence of a last mile network between the Video Edge and the client device, but to provide the greatest improvement to the end user experience the streaming ecosystem must be viewed holistically, including the use of a modern Congestion Control architecture such as PCC that can dramatically reduce overall network latency and Round Trip Time.
In summary, most commercial streaming services will continue to leverage predominantly HTTP and ABR based protocols for the foreseeable future. These deployments scale elegantly, delivery cost per byte is very reasonable and common latency targets that position the service against legacy cable and IPTV can be easily achieved with low latency versions.
About Compira Labs
Compira Labs is a pioneer in ML-powered content delivery. Its software-only solution, which includes the PCC congestion control, can easily upgrade any CDN to deliver best-in-class QoE even in the most challenging last-mile networks. Compira Labs’ solution can dramatically improve user experience for latency-sensitive and bandwidth-hungry Internet services such as live and on-demand video streaming, game downloads etc. It is available for both HTTP/TCP and HTTP/3/QUIC delivery methods.