How libvpx Selects VP9 Transform Size

In VP9 video encoding, choosing the right transform size is critical for balancing compression efficiency and computational complexity. This article explains how the reference encoder, libvpx, determines the optimal transform size—ranging from 4x4 to 32x32 pixels—using a combination of block partitioning constraints, Rate-Distortion Optimization (RDO), transform mode selection, and fast heuristic speedups.

Block Partitioning Constraints

Before libvpx can choose a transform size, it must respect the boundaries of the block partition. VP9 supports coding blocks from 64x64 down to 4x4 pixels. The transform size cannot be larger than the prediction block (Coding Unit) size.

For example: * A 64x64 block can potentially use 32x32, 16x16, 8x8, or 4x4 transforms. * An 8x8 block is limited to 8x8 or 4x4 transforms. * A 4x4 block is strictly restricted to a 4x4 transform.

Once the block size is determined, libvpx evaluates which valid transform size (or combination of smaller transform sizes) yields the best coding efficiency.

Rate-Distortion Optimization (RDO)

The primary mathematical engine for deciding transform size is Rate-Distortion Optimization (RDO). libvpx calculates the RD cost for candidate transform sizes using the Lagrangian formula:

\[\text{Cost} = D + \lambda \cdot R\]

Where: * \(D\) (Distortion): The difference between the original source block and the reconstructed block (usually measured as sum of squared errors). * \(R\) (Rate): The number of bits required to encode the transform coefficients and the transform-type signaling overhead. * \(\lambda\) (Lambda): A Lagrange multiplier determined by the quantization parameter (QP).

The encoder tests different transform sizes, performs the forward transform, quantizes the coefficients, estimates the bit rate, reconstructs the block, and calculates the distortion. The transform size that produces the lowest RD cost is selected.

VP9 Transform Modes (TX Modes)

To optimize transmission, VP9 does not always signal the transform size for every single block individually. Instead, libvpx operates in one of several “TX Modes” decided at the frame level:

Fixed TX Modes (ONLY_4X4): Force all transforms to be 4x4.
Allow-level Modes (ALLOW_8X8, ALLOW_16X16, ALLOW_32X32): Allow blocks to use up to the specified size. The encoder automatically decides whether to split a larger block into smaller transforms or keep it at the maximum allowed size.
Select Mode (TX_MODE_SELECT): The most flexible mode. It allows the encoder to dynamically choose the optimal transform size for every block and explicitly signals this decision in the bitstream.

libvpx evaluates these modes during the frame encoding process to choose the frame-level configuration that maximizes overall coding efficiency.

Heuristics and Speed-ups

Performing a full RDO search for every transform size on every block is computationally expensive. To maintain high encoding speeds (especially in real-time mode), libvpx employs several heuristic shortcuts based on the --cpu-used parameter:

Early Termination: If the quantized coefficients for a larger transform size (e.g., 16x16) result in all zeros (a “skip” block), the encoder immediately skips evaluating smaller transform sizes (8x8 and 4x4) because they are highly unlikely to yield a better RD cost.
Variance-Based Decisions: The encoder analyzes the spatial variance of the residual block. Smooth residual areas favor larger transform sizes (e.g., 32x32) to capture low-frequency energy, while highly textured or edge-heavy residuals favor smaller transform sizes (e.g., 4x4) to prevent ringing artifacts.
Coding History and Neighboring Blocks: libvpx often predicts the optimal transform size of a current block by looking at the transform sizes chosen for spatially neighboring blocks or collocated blocks in previously encoded frames.