How libvpx Uses Presentation Timestamps in Video Encoding

In video encoding, presentation timestamps (PTS) are crucial for maintaining playback synchronization and determining when frames should be displayed. This article explains how the libvpx library—the reference encoder for the VP8 and VP9 video formats—utilizes PTS and frame durations during the encoding process to manage rate control, dictate frame type decisions, support variable frame rates, and optimize multi-pass compression.

The Role of PTS in the API

In libvpx, the primary encoding function is vpx_codec_encode. This function requires three critical inputs related to timing: * Input Image: The raw video frame. * PTS (Presentation Timestamp): A timestamp indicating when the frame should be displayed, measured in stream timebase units. * Duration: How long the frame should be displayed before the next frame replaces it, also in timebase units.

Unlike encoders that rely strictly on a fixed frame rate, libvpx uses these explicit temporal markers to understand the exact real-time spacing of the input video.

1. Rate Control and Bit Allocation

The libvpx rate control algorithm heavily relies on PTS and frame duration to allocate bits across the video stream.

Dynamic Frame Rate Calculation: By analyzing the difference between consecutive PTS values (or the explicit duration passed to the encoder), libvpx calculates the instantaneous frame rate.
Temporal Bit Distribution: A frame with a longer duration (a larger gap before the next PTS) represents a longer period of static or active display on screen. The encoder allocates more bits to these frames because compression artifacts on a long-lasting frame are more noticeable to the viewer. Conversely, frames with short durations receive fewer bits.

2. Timebase Scaling

For libvpx to interpret PTS values correctly, you must define a timebase in the encoder configuration (vpx_codec_enc_cfg_t). The timebase is represented as a fraction of a second (for example, 1/90000 for standard 90 kHz clock rates, or 1/1000 for milliseconds).

The encoder translates ticks to actual time using the formula: \[\text{Time in Seconds} = \text{PTS} \times \text{Timebase}\]

If the timebase configuration does not match the scale of the incoming PTS values, the rate control loop will miscalculate the video’s speed. This results in severe bitrate spikes, buffer overflows, or degraded visual quality.

3. Variable Frame Rate (VFR) Support

Because libvpx calculates timing directly from individual PTS and duration parameters rather than a static global FPS variable, it natively supports Variable Frame Rate (VFR) encoding. When encoding security camera footage, video game captures, or presentations where the frame rate drops during static scenes, libvpx adjusts its quantization parameters (QP) frame-by-frame to match the incoming temporal density.

4. Frame Type Decisions and GOP Structure

libvpx uses PTS intervals to manage the Group of Pictures (GOP) structure, which includes keyframes and alternate reference (alt-ref) frames:

Keyframe Placement: The configuration parameters kf_min_dist and kf_max_dist define the minimum and maximum intervals between keyframes. libvpx monitors the elapsed PTS to ensure a keyframe is forced if the maximum temporal distance is reached without a scene change.
Alt-Ref Frames: In VP8 and VP9, alt-ref frames are invisible frames used as future predictors. libvpx analyzes the PTS timeline to determine the optimal lookahead distance, selecting frames further down the timeline to compress as alt-ref frames.

5. Multi-Pass Encoding

In two-pass encoding, PTS usage is split across two phases:

First Pass: The encoder processes the video to analyze scene complexity and writes a stats file. Each entry in the stats file is tagged with its PTS and duration.
Second Pass: libvpx reads the stats file. By matching the incoming raw frames to the corresponding PTS in the statistics, the encoder maps out the precise complexity of the entire timeline, distributing the target bitrate budget optimally across both high-motion and static scenes.