Migrate performance, accuracy, and quantization tests to CI registry by alisonshao · Pull Request #17177 · sgl-project/sglang

alisonshao · 2026-01-15T23:03:49Z

Summary

Add CI registration to quantization tests in test/srt/:
- quant/test_awq.py (163s) → stage-b-test-small-1-gpu
- quant/test_marlin_moe.py (200s) → stage-b-test-small-1-gpu
- test_bnb.py (5s) → stage-b-test-small-1-gpu
- test_gptqmodel_dynamic.py (102s) → stage-b-test-small-1-gpu
- test_quantization.py (185s) → stage-b-test-small-1-gpu
- test_gguf.py (96s) → stage-b-test-small-1-gpu
Add CI registration to accuracy tests:
- test_eval_accuracy_large.py (300s) → stage-b-test-small-1-gpu (1-GPU)
- test_moe_eval_accuracy_large.py (500s) → stage-b-test-large-2-gpu (TP=2)
Split performance tests by GPU requirements:
- test_bench_serving_1gpu.py → stage-b-test-small-1-gpu
  - Basic LLM throughput/latency, VLM, LoRA, score/embeddings API (5090-compatible)
- test_bench_serving_1gpu_large.py → stage-b-test-large-1-gpu
  - FP8, EAGLE speculative decoding (need H200/SM90+)
- test_bench_serving_2gpu.py → stage-b-test-large-2-gpu
  - MoE (TP=2), Pipeline Parallel (PP=2)

Test plan

Verify CI picks up the registered tests
Verify tests run on appropriate GPU runners

gemini-code-assist · 2026-01-15T23:03:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

alisonshao · 2026-01-15T23:09:01Z

/tag-and-rerun-ci

Add CI registration to quantization tests: - quant/test_awq.py (163s) -> stage-b-test-small-1-gpu - quant/test_marlin_moe.py (200s) -> stage-b-test-small-1-gpu - test_bnb.py (5s) -> stage-b-test-small-1-gpu - test_gptqmodel_dynamic.py (102s) -> stage-b-test-small-1-gpu - test_quantization.py (185s) -> stage-b-test-small-1-gpu - test_gguf.py (96s) -> stage-b-test-small-1-gpu Add CI registration to accuracy tests: - test_eval_accuracy_large.py (300s) -> stage-b-test-small-1-gpu - test_moe_eval_accuracy_large.py (500s) -> stage-b-test-large-2-gpu Split performance tests by GPU requirements: - test_bench_serving_1gpu.py -> stage-b-test-small-1-gpu (5090-compatible) - Basic LLM throughput/latency, VLM, LoRA, score/embeddings API - test_bench_serving_1gpu_large.py -> stage-b-test-large-1-gpu - FP8, EAGLE speculative decoding (need H200/SM90+) - test_bench_serving_2gpu.py -> stage-b-test-large-2-gpu - MoE (TP=2), Pipeline Parallel (PP=2)

- Move quant tests (test_awq, test_marlin_moe, test_bnb, test_gptqmodel_dynamic, test_quantization, test_gguf) to test/registered/quant/ - Move accuracy tests (test_eval_accuracy_large, test_moe_eval_accuracy_large) to test/registered/eval/ - Remove quantization_test suite from test/srt/run_suite.py - Remove migrated tests from __not_in_ci__ section

- Split test_bench_one_batch.py into 1gpu and 2gpu versions - Move to test/registered/perf/ with CI registration - Remove performance-test-* jobs (tests now in stage-b suites) - Remove accuracy-test-* jobs (tests now in stage-b suites) - Remove from pr-test-finish needs list

- Create stage-b-test-small-1-gpu-performance job - Create stage-b-test-large-1-gpu-performance job - Create stage-b-test-large-2-gpu-performance job - Create stage-b-test-small-1-gpu-accuracy job - Create stage-b-test-large-2-gpu-accuracy job - Update test registrations to use new suite names - Add new jobs to pr-test-finish needs list

- Update performance test paths to test/registered/perf/ - Update accuracy test paths to test/registered/eval/ - Update class names to match new test file structure

…unner - Add timeout-minutes: 60 to all CI jobs for safety - Move test_bench_serving_1gpu.py from stage-b-test-small-1-gpu-performance to stage-b-test-large-1-gpu-performance since it requires more GPU memory and was timing out on the small GPU runner

- Change runs-on from 1-gpu-runner to 1-gpu-5090 - Add IS_BLACKWELL: "1" environment variable - Add source /etc/profile.d/sglang-ci.sh to Install and Run steps

The 5090 runner achieves ~100 tok/s but the test expects >135 tok/s. Moving this test to run on H200 where it passes. Note: stage-b-test-small-1-gpu-performance suite is now empty.

Add test_vlm_perf_5090.py with VLM offline throughput and online latency tests registered to stage-b-test-small-1-gpu-performance suite. These tests passed on 5090 in https://github.com/sgl-project/sglang/actions/runs/21022442069

Move test_awq.py and test_gptqmodel_dynamic.py to stage-b-test-large-1-gpu suite as they fail on 5090 GPUs. Reference: https://github.com/sgl-project/sglang/actions/runs/21022442069

Change from non-existent 1-gpu-runner-large to 1-gpu-runner (H200).

Eagle speculative decoding test with max_running_requests=64 causes KV cache pool to fill up and crash the server on 5090 (32GB VRAM). Move to H200 (80GB) where it has sufficient memory.

GPTQ-INT4 model scored 0.0 on 5090, indicating it failed to run properly. Move to H200 where quantization kernels work correctly.

alisonshao · 2026-01-17T00:03:49Z

ci (need to increase timout): https://github.com/sgl-project/sglang/actions/runs/21084213109?pr=17177

- Increase Run test timeout from 20 to 30 minutes for stage-b-test-large-1-gpu-performance job - Add performance and accuracy suites to PER_COMMIT_SUITES to fix "Unknown suite" warning

- Increase step timeout from 30 to 40 minutes - Increase per-file timeout from 1200s to 1800s (30 min) test_bench_serving_1gpu.py was timing out at 97% completion (487/500).

- Split test_bench_serving_1gpu.py into: - test_bench_serving_1gpu_part1.py (LLM + LoRA tests, est_time=1000s) - test_bench_serving_1gpu_part2.py (VLM + Score + Embeddings, est_time=900s) - Add 2 partitions to stage-b-test-large-1-gpu-performance job - This fixes timeout issue where single file took 31min but had 30min timeout

Update test_bench_serving_1gpu → test_bench_serving_1gpu_part1 references in pr-test-amd.yml to match the split test files.

alisonshao · 2026-01-18T13:28:48Z

ci passed for all stage b: https://github.com/sgl-project/sglang/actions/runs/21092654922?pr=17177

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

alisonshao requested review from Fridge003, Kangyan-Zhou, ispobock and merrymercy as code owners January 15, 2026 23:03

github-actions bot added the quant LLM Quantization label Jan 15, 2026

alisonshao force-pushed the ci/migrate-perf-accuracy-quant-tests branch from 6fbfff8 to 836a2bd Compare January 15, 2026 23:07

github-actions bot added the run-ci label Jan 15, 2026

alisonshao force-pushed the ci/migrate-perf-accuracy-quant-tests branch from 836a2bd to 7b18bf6 Compare January 15, 2026 23:10

Kangyan-Zhou and others added 2 commits January 15, 2026 15:24

Merge branch 'main' into ci/migrate-perf-accuracy-quant-tests

e4bde31

Move bench_serving tests to test/registered/perf/

c27b60b

alisonshao added the high priority label Jan 15, 2026

alisonshao added 5 commits January 15, 2026 15:54

Remove quantization-test job from workflow

71c9af8

Update AMD workflow to use new test locations

31d1478

- Update performance test paths to test/registered/perf/ - Update accuracy test paths to test/registered/eval/ - Update class names to match new test file structure

github-actions bot added the amd label Jan 16, 2026

alisonshao and others added 5 commits January 16, 2026 13:18

Merge branch 'main' into ci/migrate-perf-accuracy-quant-tests

4909c40

Fix small-1-gpu performance/accuracy jobs to use 5090 runners

1abb383

- Change runs-on from 1-gpu-runner to 1-gpu-5090 - Add IS_BLACKWELL: "1" environment variable - Add source /etc/profile.d/sglang-ci.sh to Install and Run steps

Move test_bench_one_batch_1gpu.py to large-1-gpu-performance suite

5248623

The 5090 runner achieves ~100 tok/s but the test expects >135 tok/s. Moving this test to run on H200 where it passes. Note: stage-b-test-small-1-gpu-performance suite is now empty.

github-actions bot added the Multi-modal multi-modal language model label Jan 16, 2026

alisonshao added 2 commits January 16, 2026 14:25

Move failing quantization tests from small-1-gpu to large-1-gpu suite

b70b03b

Move test_awq.py and test_gptqmodel_dynamic.py to stage-b-test-large-1-gpu suite as they fail on 5090 GPUs. Reference: https://github.com/sgl-project/sglang/actions/runs/21022442069

Fix runner label for stage-b-test-large-1-gpu-performance

e06dcab

Change from non-existent 1-gpu-runner-large to 1-gpu-runner (H200).

This comment was marked as duplicate.

Sign in to view

alisonshao added 3 commits January 16, 2026 15:35

Move test_eagle_infer_beta.py to large-1-gpu suite

cfcb601

Eagle speculative decoding test with max_running_requests=64 causes KV cache pool to fill up and crash the server on 5090 (32GB VRAM). Move to H200 (80GB) where it has sufficient memory.

Move test_quantization.py to large-1-gpu suite

8ac2029

GPTQ-INT4 model scored 0.0 on 5090, indicating it failed to run properly. Move to H200 where quantization kernels work correctly.

Revert test_eagle_infer_beta.py back to small-1-gpu suite

065c581

alisonshao and others added 7 commits January 16, 2026 16:05

Increase large-1-gpu-performance timeout and add suites to registry

daaf60d

- Increase Run test timeout from 20 to 30 minutes for stage-b-test-large-1-gpu-performance job - Add performance and accuracy suites to PER_COMMIT_SUITES to fix "Unknown suite" warning

Increase timeout for large-1-gpu-performance tests

847598e

- Increase step timeout from 30 to 40 minutes - Increase per-file timeout from 1200s to 1800s (30 min) test_bench_serving_1gpu.py was timing out at 97% completion (487/500).

Merge branch 'main' into ci/migrate-perf-accuracy-quant-tests

c9f9299

Fix AMD workflow paths for renamed test files

920c727

Update test_bench_serving_1gpu → test_bench_serving_1gpu_part1 references in pr-test-amd.yml to match the split test files.

Merge branch 'main' into ci/migrate-perf-accuracy-quant-tests

7733b62

Merge branch 'main' into ci/migrate-perf-accuracy-quant-tests

3e90af3

Merge branch 'main' into ci/migrate-perf-accuracy-quant-tests

4a3d1a6

Kangyan-Zhou merged commit 8916b9d into main Jan 19, 2026
97 of 101 checks passed

Kangyan-Zhou deleted the ci/migrate-perf-accuracy-quant-tests branch January 19, 2026 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate performance, accuracy, and quantization tests to CI registry#17177

Migrate performance, accuracy, and quantization tests to CI registry#17177
Kangyan-Zhou merged 26 commits intomainfrom
ci/migrate-perf-accuracy-quant-tests

alisonshao commented Jan 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

alisonshao commented Jan 15, 2026

Uh oh!

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

alisonshao commented Jan 17, 2026

Uh oh!

alisonshao commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alisonshao commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

alisonshao commented Jan 15, 2026

Uh oh!

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

alisonshao commented Jan 17, 2026

Uh oh!

alisonshao commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alisonshao commented Jan 15, 2026 •

edited

Loading