[perf] boosting video decoding performance by 1.3~2x with opencv as decoding backend by WingEdge777 · Pull Request #13565 · sgl-project/sglang

WingEdge777 · 2025-11-19T07:48:41Z

Motivation

We conducted some video decoding benchmark tests, and the results indicate that there always exists a large performance gap between decord2 and OpenCV in the sparse/random video frame sampling scenario. OpenCV could be nearly 2x faster than decord2 in some cases.

Modifications

Thus, we should or could at least use OpenCV as one optional backend for video decoding. Use an env as a switcher for now.

benchmark

test_data: here

benchmark code

import time
from decord import VideoReader, cpu, gpu
import cv2
import numpy as np
from tqdm import tqdm

from typing import List, Any

def list_to_markdown_table(data_list: List[List[Any]]) -> str:
    """
    
    Args:
        data_list: two dimensions list, e.g. [[Header1, Header2], [Data1, Data2], ...]
        
    Returns:
        Markdown str
    """
    if not data_list or not data_list[0]:
        return "Warning: Input list is empty."

    header = data_list[0]
    data_rows = data_list[1:]
    
    num_columns = len(header) + 1

    header_line = "| sampling fps | " + " | ".join(str(h) for h in header) + " |"

    separator_line = "| " + " | ".join(["---"] * num_columns) + " |"

    data_lines = []
    for cv, de in zip(data_rows[0], data_rows[1]):
        for row in [cv, de]:
            data_line = "| " + " | ".join([str(item) for item in row]) + " |"
            data_lines.append(data_line)

    markdown_table = [header_line, separator_line] + data_lines
    return "\n".join(markdown_table)



def benchmark():
    try:
        from decord.bridge import decord_bridge

        ctx = gpu(0)
        _ = decord_bridge.get_ctx_device(ctx)
    except Exception:
        ctx = cpu(0)
    print("Using context:", ctx)

    test_files = ["test_2k.mp4", "test_1080p.mp4", "test_720p.mp4", "test_480p.mp4", "test_1080p_mobile.mp4", "test_720p_mobile.mp4", "test_480p_mobile.mp4"]

    sampling_rate = [1, 3, 5, 10, 120]
    cv_cost, decord_cost = [[]for i in range(len(test_files))], [[]for i in range(len(test_files))]
    for x, file in enumerate(tqdm(test_files)):
        cv_cost[x].append("cv_"+file)
        decord_cost[x].append("de_"+file)

        # use opencv get video meta
        vc = cv2.VideoCapture(file)
        total_frames = int(vc.get(cv2.CAP_PROP_FRAME_COUNT))
        video_fps = vc.get(cv2.CAP_PROP_FPS)
        width = int(vc.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(vc.get(cv2.CAP_PROP_FRAME_HEIGHT))
        # print(video_fps, width, height, file)
        vc.release()
        
        # run benchmark
        for fps in sampling_rate:
            if fps > video_fps:
                fps = video_fps
            nframes = int(total_frames / video_fps * fps)
            frame_idx = np.linspace(0, total_frames - 1, num=nframes, dtype=np.int64)
            frame_idx = np.unique(frame_idx)
            nframes = frame_idx.shape[0]

            cnt = 10
            st = time.time()
            for _ in range(cnt):
                vc = cv2.VideoCapture(file)
                total_frames = int(vc.get(cv2.CAP_PROP_FRAME_COUNT))
                video_np = np.empty((nframes, height, width, 3), dtype=np.uint8)
                mx_idx = min(total_frames, max(frame_idx) + 1)
                i = 0
                for idx in range(mx_idx):
                    ok = vc.grab()
                    if not ok:
                        break
                    if idx in frame_idx:
                        ret, frame = vc.retrieve()
                        if ret:
                            video_np[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                            i += 1
            vc.release()
            ed = time.time()
            t = f"{(ed-st) / cnt:.3f}"
            cv_cost[x].append(t)

            # print(video_np.shape)
            # print(i,"cv", cv_cost)


            st = time.time()
            for _ in range(cnt):
                vr = VideoReader(file, ctx=ctx)
                video_np = vr.get_batch(frame_idx).asnumpy()
            ed = time.time()
            t = f"{(ed-st) / cnt:.3f}"
            decord_cost[x].append(t)

            # print(video_np.shape)
            # print(i,"de", decord_cost)
    return [sampling_rate, cv_cost, decord_cost]


if __name__ == "__main__":
    output = list_to_markdown_table(benchmark())
    print(output)

benchmark result

Cases: different sampling fps to full frames extraction;
Time unit: s

result table

Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz + T4

sampling fps	1	3	5	10	120
cv_test_2k.mp4	0.829	0.891	0.953	1.325	3.642
de_test_2k.mp4	1.164	1.591	1.870	2.690	4.610
cv_test_1080p.mp4	1.193	1.311	1.439	2.267	6.244
de_test_1080p.mp4	1.671	2.155	2.555	3.278	9.514
cv_test_720p.mp4	0.694	0.739	0.777	1.075	2.933
de_test_720p.mp4	0.908	1.202	1.365	1.626	4.422
cv_test_480p.mp4	0.408	0.424	0.440	0.513	1.273
de_test_480p.mp4	0.584	0.716	0.791	0.950	1.448
cv_test_1080p_mobile.mp4	0.801	0.952	1.213	2.199	6.224
de_test_1080p_mobile.mp4	1.289	2.023	2.817	4.579	7.533
cv_test_720p_mobile.mp4	0.425	0.492	0.650	1.099	2.891
de_test_720p_mobile.mp4	0.636	0.966	1.308	1.868	3.567
cv_test_480p_mobile.mp4	0.242	0.266	0.319	0.477	1.233
de_test_480p_mobile.mp4	0.348	0.457	0.560	0.789	1.518

AMD EPYC 7W83 64-Core Processor + A10

sampling fps	1	3	5	10	120
cv_test_2k.mp4	0.633	0.666	0.667	0.743	1.608
de_test_2k.mp4	0.862	1.057	1.122	1.235	2.819
cv_test_1080p.mp4	0.919	0.950	0.987	1.237	2.825
de_test_1080p.mp4	1.289	1.527	1.755	2.132	5.056
cv_test_720p.mp4	0.539	0.552	0.573	0.681	1.262
de_test_720p.mp4	0.775	0.863	0.919	1.086	2.348
cv_test_480p.mp4	0.327	0.331	0.343	0.357	0.628
de_test_480p.mp4	0.459	0.533	0.579	0.589	1.032
cv_test_1080p_mobile.mp4	0.586	0.683	0.701	1.086	2.633
de_test_1080p_mobile.mp4	0.923	1.315	1.935	2.634	4.695
cv_test_720p_mobile.mp4	0.332	0.356	0.381	0.477	1.584
de_test_720p_mobile.mp4	0.491	0.605	0.772	1.037	2.099
cv_test_480p_mobile.mp4	0.173	0.183	0.192	0.260	0.571
de_test_480p_mobile.mp4	0.252	0.301	0.328	0.418	1.034

AMD EPYC 9K84 96-Core Processor + L20

sampling fps	1	3	5	10	120
cv_test_2k.mp4	0.583	0.640	0.694	0.955	2.365
de_test_2k.mp4	0.789	1.100	1.412	2.007	3.345
cv_test_1080p.mp4	0.863	0.966	1.078	1.594	4.234
de_test_1080p.mp4	1.131	1.485	1.824	2.620	5.582
cv_test_720p.mp4	0.511	0.541	0.576	0.756	1.985
de_test_720p.mp4	0.652	0.828	0.977	1.285	2.556
cv_test_480p.mp4	0.293	0.300	0.315	0.378	0.830
de_test_480p.mp4	0.398	0.493	0.564	0.714	1.307
cv_test_1080p_mobile.mp4	0.592	0.716	0.891	1.501	4.067
de_test_1080p_mobile.mp4	0.860	1.422	2.057	3.349	6.102
cv_test_720p_mobile.mp4	0.333	0.364	0.427	0.683	1.933
de_test_720p_mobile.mp4	0.444	0.688	0.942	1.357	2.735
cv_test_480p_mobile.mp4	0.174	0.184	0.210	0.335	0.808
de_test_480p_mobile.mp4	0.238	0.324	0.431	0.619	1.337

AMD EPYC 9K84 96-Core Processor + H20

sampling fps	1	3	5	10	120
cv_test_2k.mp4	0.620	0.651	0.661	0.789	1.590
de_test_2k.mp4	0.837	1.030	1.117	1.233	2.922
cv_test_1080p.mp4	0.930	0.941	1.017	1.189	2.783
de_test_1080p.mp4	1.209	1.424	1.675	2.097	5.129
cv_test_720p.mp4	0.529	0.543	0.540	0.610	1.222
de_test_720p.mp4	0.731	0.826	0.893	1.081	2.439
cv_test_480p.mp4	0.328	0.334	0.343	0.350	0.600
de_test_480p.mp4	0.449	0.532	0.575	0.580	0.986
cv_test_1080p_mobile.mp4	0.598	0.665	0.703	1.145	2.804
de_test_1080p_mobile.mp4	0.908	1.334	1.802	2.652	5.105
cv_test_720p_mobile.mp4	0.308	0.339	0.353	0.461	1.256
de_test_720p_mobile.mp4	0.463	0.599	0.753	0.995	2.026
cv_test_480p_mobile.mp4	0.185	0.186	0.191	0.241	0.591
de_test_480p_mobile.mp4	0.267	0.302	0.337	0.439	1.081

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-19T07:48:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

WingEdge777 · 2025-11-20T04:48:13Z

@yhyang201 @JustinTong0323 @mickqian @hnyls2002, could you please take a look at this, thanks.

yhyang201 · 2025-11-20T05:00:34Z

@yhyang201 @JustinTong0323 @mickqian @hnyls2002, could you please take a look at this, thanks.

We are currently reviewing it and setting up the video bench methodology. We will validate this PR and merge it as soon as possible!

yhyang201 · 2025-11-21T04:41:50Z

Great work, and thank you for your contribution! We will trigger the CI and proceed with the merge.

WingEdge777 · 2025-11-21T06:09:00Z

Thank you too! For taking your time reviewing

add opencv as optional video decode backend

377ba47

WingEdge777 requested review from JustinTong0323, mickqian and yhyang201 as code owners November 19, 2025 07:48

yhyang201 added the run-ci label Nov 21, 2025

yhyang201 mentioned this pull request Nov 21, 2025

[Performance] Replace preprocess_video logic from GLM multimodal processor with transformer impl for speed up (up to 27% faster) and addressing OOM (up to 50x improvements) #13487

Merged

5 tasks

WingEdge777 added 6 commits November 26, 2025 10:43

merge main

b45aaba

merge main

40ca965

merge main

c23181d

update read_video_frames_opencv : more robust frame sampling

8f1afa1

Merge branch 'main' into opt.video_cv2_decoding

e1aff6b

merge main

26a66b6

WingEdge777 force-pushed the opt.video_cv2_decoding branch from c3292f3 to 26a66b6 Compare February 8, 2026 14:02

WingEdge777 closed this Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] boosting video decoding performance by 1.3~2x with opencv as decoding backend#13565

[perf] boosting video decoding performance by 1.3~2x with opencv as decoding backend#13565
WingEdge777 wants to merge 7 commits intosgl-project:mainfrom
WingEdge777:opt.video_cv2_decoding

WingEdge777 commented Nov 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Uh oh!

WingEdge777 commented Nov 20, 2025

Uh oh!

yhyang201 commented Nov 20, 2025

Uh oh!

yhyang201 commented Nov 21, 2025

Uh oh!

WingEdge777 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WingEdge777 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

benchmark

benchmark result

Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz + T4

AMD EPYC 7W83 64-Core Processor + A10

AMD EPYC 9K84 96-Core Processor + L20

AMD EPYC 9K84 96-Core Processor + H20

Checklist

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Uh oh!

WingEdge777 commented Nov 20, 2025

Uh oh!

yhyang201 commented Nov 20, 2025

Uh oh!

yhyang201 commented Nov 21, 2025

Uh oh!

WingEdge777 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WingEdge777 commented Nov 19, 2025 •

edited

Loading