Dealing with potential frame shape mismatch for batch APIs #433

NicolasHug · 2024-12-13T10:01:37Z

Pre-read: the current state of output tensor allocation as described in the comment below:

torchcodec/src/torchcodec/decoders/_core/VideoDecoder.h

Lines 430 to 473 in 84cef50

    
           // -------------------------------------------------------------------------- 
        
           // FRAME TENSOR ALLOCATION APIs 
        
           // -------------------------------------------------------------------------- 
        
           // Note [Frame Tensor allocation and height and width] 
        
           // 
        
           // We always allocate [N]HWC tensors. The low-level decoding functions all 
        
           // assume HWC tensors, since this is what FFmpeg natively handles. It's up to 
        
           // the high-level decoding entry-points to permute that back to CHW, by calling 
        
           // MaybePermuteHWC2CHW(). 
        
           // 
        
           // Also, importantly, the way we figure out the the height and width of the 
        
           // output frame tensor varies, and depends on the decoding entry-point. In 
        
           // *decreasing order of accuracy*, we use the following sources for determining 
        
           // height and width: 
        
           // - getHeightAndWidthFromResizedAVFrame(). This is the height and width of the 
        
           //   AVframe, *post*-resizing. This is only used for single-frame decoding APIs, 
        
           //   on CPU, with filtergraph. 
        
           // - getHeightAndWidthFromOptionsOrAVFrame(). This is the height and width from 
        
           //   the user-specified options if they exist, or the height and width of the 
        
           //   AVFrame *before* it is resized. In theory, i.e. if there are no bugs within 
        
           //   our code or within FFmpeg code, this should be exactly the same as 
        
           //   getHeightAndWidthFromResizedAVFrame(). This is used by single-frame 
        
           //   decoding APIs, on CPU with swscale, and on GPU. 
        
           // - getHeightAndWidthFromOptionsOrMetadata(). This is the height and width from 
        
           //   the user-specified options if they exist, or the height and width form the 
        
           //   stream metadata, which itself got its value from the CodecContext, when the 
        
           //   stream was added. This is used by batch decoding APIs, for both GPU and 
        
           //   CPU. 
        
           // 
        
           // The source of truth for height and width really is the (resized) AVFrame: it 
        
           // comes from the decoded ouptut of FFmpeg. The info from the metadata (i.e. 
        
           // from the CodecContext) may not be as accurate. However, the AVFrame is only 
        
           // available late in the call stack, when the frame is decoded, while the 
        
           // CodecContext is available early when a stream is added. This is why we use 
        
           // the CodecContext for pre-allocating batched output tensors (we could 
        
           // pre-allocate those only once we decode the first frame to get the info frame 
        
           // the AVFrame, but that's a more complex logic). 
        
           // 
        
           // Because the sources for height and width may disagree, we may end up with 
        
           // conflicts: e.g. if we pre-allocate a batch output tensor based on the 
        
           // metadata info, but the decoded AVFrame has a different height and width. 
        
           // it is very important to check the height and width assumptions where the 
        
           // tensors memory is used/filled in order to avoid segfaults.

This issue is about addressing that specific part of the comment:

torchcodec/src/torchcodec/decoders/_core/VideoDecoder.h

Lines 464 to 474 in 84cef50

    
           // CodecContext is available early when a stream is added. This is why we use 
        
           // the CodecContext for pre-allocating batched output tensors (we could 
        
           // pre-allocate those only once we decode the first frame to get the info frame 
        
           // the AVFrame, but that's a more complex logic). 
        
           // 
        
           // Because the sources for height and width may disagree, we may end up with 
        
           // conflicts: e.g. if we pre-allocate a batch output tensor based on the 
        
           // metadata info, but the decoded AVFrame has a different height and width. 
        
           // it is very important to check the height and width assumptions where the 
        
           // tensors memory is used/filled in order to avoid segfaults.

Basically, the idea is that instead of allocating batch output tensors based on H and W from the metadata, we could allocate them only after we have decoded the first frame, and rely on the H and W from that decoded frame, which is the source of truth for dimensions.

NicolasHug mentioned this issue Dec 13, 2024

Lazily init filtergraph so it can respect raw decoded resolution #432

Merged

scotts mentioned this issue Dec 18, 2024

Add public, nonbatch option for benchmarking #438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with potential frame shape mismatch for batch APIs #433

Dealing with potential frame shape mismatch for batch APIs #433

NicolasHug commented Dec 13, 2024

Dealing with potential frame shape mismatch for batch APIs #433

Dealing with potential frame shape mismatch for batch APIs #433

Comments

NicolasHug commented Dec 13, 2024