VP9 Spatial Scalability SVC in libvpx

This article explains how the libvpx library implements and manages spatial scalability (SVC) for the VP9 video codec. It covers the core mechanics of layer configuration, inter-layer prediction, reference frame management, and the API controls required to configure scalable VP9 video streams for adaptive streaming environments.

Core Mechanics of VP9 SVC in libvpx

Spatial Scalability (SVC) in VP9 allows an encoder to output a single bitstream containing multiple resolution layers (e.g., 180p, 360p, and 720p). The libvpx library handles this by encoding a base layer at the lowest resolution and one or more enhancement layers at higher resolutions.

Instead of encoding each resolution independently—which is known as simulcast—libvpx utilizes inter-layer prediction. This means higher-resolution enhancement layers can use the reconstructed frames of lower-resolution layers as reference frames. This reduces redundancy and saves significant bandwidth.

Layer Configuration and Bitrate Allocation

To set up spatial scalability, libvpx requires the application to define the number of spatial layers and allocate bitrates for each. This is managed using the vpx_codec_enc_cfg_t configuration structure.

Layer Count: The parameter ss_number_layers defines the total number of spatial layers (up to 5 in VP9).
Scaling Factors: The resolution of each spatial layer is defined as a ratio of the input image size. Typically, these are set using dyadic scaling (where each layer is 1/2 the width and height of the next), configured via ss_enable_auto_alt_ref and layer-specific scaling factors.
Target Bitrates: The encoder distributes the total target bandwidth among the active layers using the layer_target_bitrate array. This ensures each resolution layer receives an appropriate allocation for its complexity.

Reference Frame Management and Prediction

The efficiency of VP9 SVC in libvpx relies on how reference frames are managed across spatial and temporal dimensions. Libvpx maintains a buffer of reference frames that layers can share.

When encoding a frame in an enhancement layer, libvpx can refer to: 1. Temporal References: Previous frames from the same spatial layer (e.g., LAST, GOLDEN, or ALTREF frames). 2. Spatial References: The decoded frame from the immediate lower spatial layer at the exact same point in time.

Before using a lower-layer frame as a reference, libvpx performs a high-quality scaling operation to upscale the lower-resolution reference frame to match the resolution of the current enhancement layer.

Flexible vs. Non-Flexible Encoding Modes

Libvpx offers two modes for handling the prediction structure in VP9 SVC:

Non-Flexible Mode

In this mode, the programmer defines a fixed, repeating pattern of spatial and temporal layers (often referred to as a GOP structure) at the start of the session. Libvpx automatically handles all reference frame assignments, state transitions, and buffer updates based on this pre-determined template. This mode is easier to implement but offers less adaptability to changing network conditions.

Flexible Mode

Flexible mode gives the application frame-by-frame control over reference frames. Enabled by setting the VP9E_SET_SVC parameter, the application must use the vpx_svc_ref_frame_config_t structure to explicitly declare: * Which reference buffers the current frame will read from. * Which reference buffers the current frame will update after being encoded. * The spatial and temporal layer IDs for the current frame.

This mode is highly utilized in real-time communication tools (like WebRTC) to dynamically drop layers or alter prediction paths on the fly when packet loss or network congestion occurs.