VP9 Spatial Scalability SVC in libvpx
This article explains how the libvpx library implements and manages spatial scalability (SVC) for the VP9 video codec. It covers the core mechanics of layer configuration, inter-layer prediction, reference frame management, and the API controls required to configure scalable VP9 video streams for adaptive streaming environments.
Core Mechanics of VP9 SVC in libvpx
Spatial Scalability (SVC) in VP9 allows an encoder to output a single bitstream containing multiple resolution layers (e.g., 180p, 360p, and 720p). The libvpx library handles this by encoding a base layer at the lowest resolution and one or more enhancement layers at higher resolutions.
Instead of encoding each resolution independently—which is known as simulcast—libvpx utilizes inter-layer prediction. This means higher-resolution enhancement layers can use the reconstructed frames of lower-resolution layers as reference frames. This reduces redundancy and saves significant bandwidth.
Layer Configuration and Bitrate Allocation
To set up spatial scalability, libvpx requires the application to
define the number of spatial layers and allocate bitrates for each. This
is managed using the vpx_codec_enc_cfg_t configuration
structure.
- Layer Count: The parameter
ss_number_layersdefines the total number of spatial layers (up to 5 in VP9). - Scaling Factors: The resolution of each spatial
layer is defined as a ratio of the input image size. Typically, these
are set using dyadic scaling (where each layer is 1/2 the width and
height of the next), configured via
ss_enable_auto_alt_refand layer-specific scaling factors. - Target Bitrates: The encoder distributes the total
target bandwidth among the active layers using the
layer_target_bitratearray. This ensures each resolution layer receives an appropriate allocation for its complexity.
Reference Frame Management and Prediction
The efficiency of VP9 SVC in libvpx relies on how reference frames are managed across spatial and temporal dimensions. Libvpx maintains a buffer of reference frames that layers can share.
When encoding a frame in an enhancement layer, libvpx can refer to: 1. Temporal References: Previous frames from the same spatial layer (e.g., LAST, GOLDEN, or ALTREF frames). 2. Spatial References: The decoded frame from the immediate lower spatial layer at the exact same point in time.
Before using a lower-layer frame as a reference, libvpx performs a high-quality scaling operation to upscale the lower-resolution reference frame to match the resolution of the current enhancement layer.
Flexible vs. Non-Flexible Encoding Modes
Libvpx offers two modes for handling the prediction structure in VP9 SVC:
Non-Flexible Mode
In this mode, the programmer defines a fixed, repeating pattern of spatial and temporal layers (often referred to as a GOP structure) at the start of the session. Libvpx automatically handles all reference frame assignments, state transitions, and buffer updates based on this pre-determined template. This mode is easier to implement but offers less adaptability to changing network conditions.
Flexible Mode
Flexible mode gives the application frame-by-frame control over
reference frames. Enabled by setting the VP9E_SET_SVC
parameter, the application must use the
vpx_svc_ref_frame_config_t structure to explicitly declare:
* Which reference buffers the current frame will read from. * Which
reference buffers the current frame will update after being encoded. *
The spatial and temporal layer IDs for the current frame.
This mode is highly utilized in real-time communication tools (like WebRTC) to dynamically drop layers or alter prediction paths on the fly when packet loss or network congestion occurs.