Skip to content

Migrate performance, accuracy, and quantization tests to CI registry#17177

Merged
Kangyan-Zhou merged 26 commits intomainfrom
ci/migrate-perf-accuracy-quant-tests
Jan 19, 2026
Merged

Migrate performance, accuracy, and quantization tests to CI registry#17177
Kangyan-Zhou merged 26 commits intomainfrom
ci/migrate-perf-accuracy-quant-tests

Conversation

@alisonshao
Copy link
Collaborator

@alisonshao alisonshao commented Jan 15, 2026

Summary

  • Add CI registration to quantization tests in test/srt/:

    • quant/test_awq.py (163s) → stage-b-test-small-1-gpu
    • quant/test_marlin_moe.py (200s) → stage-b-test-small-1-gpu
    • test_bnb.py (5s) → stage-b-test-small-1-gpu
    • test_gptqmodel_dynamic.py (102s) → stage-b-test-small-1-gpu
    • test_quantization.py (185s) → stage-b-test-small-1-gpu
    • test_gguf.py (96s) → stage-b-test-small-1-gpu
  • Add CI registration to accuracy tests:

    • test_eval_accuracy_large.py (300s) → stage-b-test-small-1-gpu (1-GPU)
    • test_moe_eval_accuracy_large.py (500s) → stage-b-test-large-2-gpu (TP=2)
  • Split performance tests by GPU requirements:

    • test_bench_serving_1gpu.py → stage-b-test-small-1-gpu
      • Basic LLM throughput/latency, VLM, LoRA, score/embeddings API (5090-compatible)
    • test_bench_serving_1gpu_large.py → stage-b-test-large-1-gpu
      • FP8, EAGLE speculative decoding (need H200/SM90+)
    • test_bench_serving_2gpu.py → stage-b-test-large-2-gpu
      • MoE (TP=2), Pipeline Parallel (PP=2)

Test plan

  • Verify CI picks up the registered tests
  • Verify tests run on appropriate GPU runners

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the quant LLM Quantization label Jan 15, 2026
@alisonshao alisonshao force-pushed the ci/migrate-perf-accuracy-quant-tests branch from 6fbfff8 to 836a2bd Compare January 15, 2026 23:07
@alisonshao
Copy link
Collaborator Author

/tag-and-rerun-ci

Add CI registration to quantization tests:
- quant/test_awq.py (163s) -> stage-b-test-small-1-gpu
- quant/test_marlin_moe.py (200s) -> stage-b-test-small-1-gpu
- test_bnb.py (5s) -> stage-b-test-small-1-gpu
- test_gptqmodel_dynamic.py (102s) -> stage-b-test-small-1-gpu
- test_quantization.py (185s) -> stage-b-test-small-1-gpu
- test_gguf.py (96s) -> stage-b-test-small-1-gpu

Add CI registration to accuracy tests:
- test_eval_accuracy_large.py (300s) -> stage-b-test-small-1-gpu
- test_moe_eval_accuracy_large.py (500s) -> stage-b-test-large-2-gpu

Split performance tests by GPU requirements:
- test_bench_serving_1gpu.py -> stage-b-test-small-1-gpu (5090-compatible)
  - Basic LLM throughput/latency, VLM, LoRA, score/embeddings API
- test_bench_serving_1gpu_large.py -> stage-b-test-large-1-gpu
  - FP8, EAGLE speculative decoding (need H200/SM90+)
- test_bench_serving_2gpu.py -> stage-b-test-large-2-gpu
  - MoE (TP=2), Pipeline Parallel (PP=2)
@alisonshao alisonshao force-pushed the ci/migrate-perf-accuracy-quant-tests branch from 836a2bd to 7b18bf6 Compare January 15, 2026 23:10
- Move quant tests (test_awq, test_marlin_moe, test_bnb, test_gptqmodel_dynamic, test_quantization, test_gguf) to test/registered/quant/
- Move accuracy tests (test_eval_accuracy_large, test_moe_eval_accuracy_large) to test/registered/eval/
- Remove quantization_test suite from test/srt/run_suite.py
- Remove migrated tests from __not_in_ci__ section
- Split test_bench_one_batch.py into 1gpu and 2gpu versions
- Move to test/registered/perf/ with CI registration
- Remove performance-test-* jobs (tests now in stage-b suites)
- Remove accuracy-test-* jobs (tests now in stage-b suites)
- Remove from pr-test-finish needs list
- Create stage-b-test-small-1-gpu-performance job
- Create stage-b-test-large-1-gpu-performance job
- Create stage-b-test-large-2-gpu-performance job
- Create stage-b-test-small-1-gpu-accuracy job
- Create stage-b-test-large-2-gpu-accuracy job
- Update test registrations to use new suite names
- Add new jobs to pr-test-finish needs list
- Update performance test paths to test/registered/perf/
- Update accuracy test paths to test/registered/eval/
- Update class names to match new test file structure
@github-actions github-actions bot added the amd label Jan 16, 2026
alisonshao and others added 5 commits January 16, 2026 13:18
…unner

- Add timeout-minutes: 60 to all CI jobs for safety
- Move test_bench_serving_1gpu.py from stage-b-test-small-1-gpu-performance
  to stage-b-test-large-1-gpu-performance since it requires more GPU memory
  and was timing out on the small GPU runner
- Change runs-on from 1-gpu-runner to 1-gpu-5090
- Add IS_BLACKWELL: "1" environment variable
- Add source /etc/profile.d/sglang-ci.sh to Install and Run steps
The 5090 runner achieves ~100 tok/s but the test expects >135 tok/s.
Moving this test to run on H200 where it passes.

Note: stage-b-test-small-1-gpu-performance suite is now empty.
Add test_vlm_perf_5090.py with VLM offline throughput and online latency
tests registered to stage-b-test-small-1-gpu-performance suite.

These tests passed on 5090 in https://github.com/sgl-project/sglang/actions/runs/21022442069
@github-actions github-actions bot added the Multi-modal multi-modal language model label Jan 16, 2026
Move test_awq.py and test_gptqmodel_dynamic.py to stage-b-test-large-1-gpu
suite as they fail on 5090 GPUs.

Reference: https://github.com/sgl-project/sglang/actions/runs/21022442069
Change from non-existent 1-gpu-runner-large to 1-gpu-runner (H200).
@alisonshao

This comment was marked as duplicate.

@github-actions

This comment was marked as duplicate.

@github-actions

This comment was marked as duplicate.

Eagle speculative decoding test with max_running_requests=64 causes
KV cache pool to fill up and crash the server on 5090 (32GB VRAM).
Move to H200 (80GB) where it has sufficient memory.
GPTQ-INT4 model scored 0.0 on 5090, indicating it failed to run
properly. Move to H200 where quantization kernels work correctly.
@alisonshao
Copy link
Collaborator Author

alisonshao and others added 7 commits January 16, 2026 16:05
- Increase Run test timeout from 20 to 30 minutes for
  stage-b-test-large-1-gpu-performance job
- Add performance and accuracy suites to PER_COMMIT_SUITES to fix
  "Unknown suite" warning
- Increase step timeout from 30 to 40 minutes
- Increase per-file timeout from 1200s to 1800s (30 min)

test_bench_serving_1gpu.py was timing out at 97% completion (487/500).
- Split test_bench_serving_1gpu.py into:
  - test_bench_serving_1gpu_part1.py (LLM + LoRA tests, est_time=1000s)
  - test_bench_serving_1gpu_part2.py (VLM + Score + Embeddings, est_time=900s)
- Add 2 partitions to stage-b-test-large-1-gpu-performance job
- This fixes timeout issue where single file took 31min but had 30min timeout
Update test_bench_serving_1gpu → test_bench_serving_1gpu_part1 references
in pr-test-amd.yml to match the split test files.
@alisonshao
Copy link
Collaborator Author

@Kangyan-Zhou Kangyan-Zhou merged commit 8916b9d into main Jan 19, 2026
97 of 101 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the ci/migrate-perf-accuracy-quant-tests branch January 19, 2026 07:25
DotSlash-A pushed a commit to DotSlash-A/sglang that referenced this pull request Jan 19, 2026
* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256)

* [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225)

Signed-off-by: Lancer <maruixiang6688@gmail.com>

* Add runner utilization report workflow (sgl-project#17234)

* cli: support sglang version (sgl-project#17250)

* Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261)

* [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [Tiny] Improve docs (sgl-project#17264)

* [diffusion] fix: set guidance_scale default to None (sgl-project#17182)

* Tiny fix comment typo (sgl-project#17287)

* [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974)

* Add kl test for swa radix cache (sgl-project#17281)

* fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

* Move radix cache related tests (sgl-project#17295)

* [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534)

Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>

* [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212)

Co-authored-by: Shangming Cai <csmthu@gmail.com>

* [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296)

* [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649)

* [NPU]Support GPT-OSS for NPU (sgl-project#14197)

* [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631)

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>

* [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235)

Co-authored-by: root <root@ubuntu-nvidia.localdomain>

* Update CODEOWNERS for multimodal_gen (sgl-project#17308)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [Feature] overlap LoRA weight loading with compute (sgl-project#15512)

* [PD] Optimize MHA models pp util calculation logic (sgl-project#17306)

* [Minor] Correct sglang version when installing from source (sgl-project#17315)

* Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347)

* [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961)

* Update code sync scripts (sgl-project#17319)

* [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* support new qwen3_coder_detector (sgl-project#16744)

Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>

* Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325)

* KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412)

* [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)

Co-authored-by: Minglei Zhu <zminglei@linkedin.com>

* [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245)

* fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238)

* Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332)

* Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177)

Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

* Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158)

* [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309)

* [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix v32 continue_final_message not work (sgl-project#16567)

* Evict swa kv cache during decoding (sgl-project#17220)

* [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142)

Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>

* [AMD CI] Migrate and Add More Testcases (sgl-project#17116)

Co-authored-by: yctseng0211 <yctseng@amd.com>

* [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345)

* Restore deepseek_v2.py to main's code, except the utils

* Ran `pre-commit`

---------

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Hudson Xing <1277646412@qq.com>
Co-authored-by: Lancer <402430575@qq.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Jerry Ji <jerryjilol@gmail.com>
Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Koushik Dutta <koush@koushikdutta.com>
Co-authored-by: root <root@ubuntu-nvidia.localdomain>
Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Lee Nau <lnau@nvidia.com>
Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com>
Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Minglei Zhu <zminglei@linkedin.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: zhangheng <hzh0425@apache.org>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd high priority Multi-modal multi-modal language model quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants