How libvpx Performs Spatial Resampling on the Fly

This article explains how the libvpx library—the reference software codec for VP8 and VP9—handles spatial resampling dynamically during video encoding and decoding. It covers the underlying mechanisms of reference frame scaling, the mathematical interpolation filters used, and how this process enables seamless resolution switching without the need for keyframes.

Dynamic Resolution Switching

In traditional video coding, changing the resolution of a video stream requires the encoder to insert a keyframe (I-frame). Because keyframes do not rely on previous frames for temporal prediction, they are highly inefficient and cause massive bitrate spikes. To solve this, the VP9 codec implemented in libvpx introduces reference frame resampling, which allows the encoder to change the frame resolution on the fly without inserting a keyframe.

During an active encoding session, libvpx can receive a command to change the output resolution (for instance, due to network congestion in a WebRTC application). Instead of resetting the encoder, libvpx continues to emit inter-predicted frames (P-frames or B-frames) at the new resolution, utilizing previously encoded frames of a different resolution as references.

The Scaling Mechanism

To perform spatial resampling on the fly, libvpx resizes the reference frames stored in its buffer to match the dimensions of the current frame being encoded or decoded.

Buffer Allocation: The decoder maintains reference frame buffers at the maximum resolution defined at the start of the sequence. This ensures there is always enough memory allocated to handle scaling up to the maximum configuration.
Filtering and Interpolation: When the current frame’s resolution differs from a reference frame’s resolution, libvpx applies a scale factor. It uses high-quality 8-tap polyphase interpolation filters to upscale or downscale the reconstructed pixels of the reference frame. These filters ensure that high-frequency details are preserved during upscaling and aliasing is minimized during downscaling.
Motion Vector Scaling: Because the reference frame has been spatially rescaled, the motion vectors (MVs) pointing to it must also be adjusted. libvpx scales the coordinates of the motion vectors by the same horizontal and vertical ratios as the frame dimensions. If a frame is downscaled by half, the corresponding motion vectors are divided by two so they point to the correct spatial location in the scaled reference buffer.

Advantages in Real-Time Communication

By executing spatial resampling on the fly, libvpx provides significant advantages for real-time video delivery, such as WebRTC and live streaming:

No Keyframe Spikes: Since the encoder does not need to send an intra-only keyframe to change resolution, the bitrate remains stable, preventing network congestion and packet loss.
Instantaneous Adaptation: The encoder can downscale the resolution instantly when bandwidth drops, maintaining a high frame rate and low latency, and then upscale back to high definition as soon as network conditions improve.
Seamless Playback: Decoders process the resolution changes mid-stream without visible pauses or decoding restarts, resulting in a smooth user experience.