How libvpx Selects VP9 Transform Size
In VP9 video encoding, choosing the right transform size is critical
for balancing compression efficiency and computational complexity. This
article explains how the reference encoder, libvpx,
determines the optimal transform size—ranging from 4x4 to 32x32
pixels—using a combination of block partitioning constraints,
Rate-Distortion Optimization (RDO), transform mode selection, and fast
heuristic speedups.
Block Partitioning Constraints
Before libvpx can choose a transform size, it must
respect the boundaries of the block partition. VP9 supports coding
blocks from 64x64 down to 4x4 pixels. The transform size cannot be
larger than the prediction block (Coding Unit) size.
For example: * A 64x64 block can potentially use 32x32, 16x16, 8x8, or 4x4 transforms. * An 8x8 block is limited to 8x8 or 4x4 transforms. * A 4x4 block is strictly restricted to a 4x4 transform.
Once the block size is determined, libvpx evaluates
which valid transform size (or combination of smaller transform sizes)
yields the best coding efficiency.
Rate-Distortion Optimization (RDO)
The primary mathematical engine for deciding transform size is
Rate-Distortion Optimization (RDO). libvpx calculates the
RD cost for candidate transform sizes using the Lagrangian formula:
\[\text{Cost} = D + \lambda \cdot R\]
Where: * \(D\) (Distortion): The difference between the original source block and the reconstructed block (usually measured as sum of squared errors). * \(R\) (Rate): The number of bits required to encode the transform coefficients and the transform-type signaling overhead. * \(\lambda\) (Lambda): A Lagrange multiplier determined by the quantization parameter (QP).
The encoder tests different transform sizes, performs the forward transform, quantizes the coefficients, estimates the bit rate, reconstructs the block, and calculates the distortion. The transform size that produces the lowest RD cost is selected.
VP9 Transform Modes (TX Modes)
To optimize transmission, VP9 does not always signal the transform
size for every single block individually. Instead, libvpx
operates in one of several “TX Modes” decided at the frame level:
- Fixed TX Modes (
ONLY_4X4): Force all transforms to be 4x4. - Allow-level Modes (
ALLOW_8X8,ALLOW_16X16,ALLOW_32X32): Allow blocks to use up to the specified size. The encoder automatically decides whether to split a larger block into smaller transforms or keep it at the maximum allowed size. - Select Mode (
TX_MODE_SELECT): The most flexible mode. It allows the encoder to dynamically choose the optimal transform size for every block and explicitly signals this decision in the bitstream.
libvpx evaluates these modes during the frame encoding
process to choose the frame-level configuration that maximizes overall
coding efficiency.
Heuristics and Speed-ups
Performing a full RDO search for every transform size on every block
is computationally expensive. To maintain high encoding speeds
(especially in real-time mode), libvpx employs several
heuristic shortcuts based on the --cpu-used parameter:
- Early Termination: If the quantized coefficients for a larger transform size (e.g., 16x16) result in all zeros (a “skip” block), the encoder immediately skips evaluating smaller transform sizes (8x8 and 4x4) because they are highly unlikely to yield a better RD cost.
- Variance-Based Decisions: The encoder analyzes the spatial variance of the residual block. Smooth residual areas favor larger transform sizes (e.g., 32x32) to capture low-frequency energy, while highly textured or edge-heavy residuals favor smaller transform sizes (e.g., 4x4) to prevent ringing artifacts.
- Coding History and Neighboring Blocks:
libvpxoften predicts the optimal transform size of a current block by looking at the transform sizes chosen for spatially neighboring blocks or collocated blocks in previously encoded frames.