Migrate performance, accuracy, and quantization tests to CI registry#17177
Merged
Kangyan-Zhou merged 26 commits intomainfrom Jan 19, 2026
Merged
Migrate performance, accuracy, and quantization tests to CI registry#17177Kangyan-Zhou merged 26 commits intomainfrom
Kangyan-Zhou merged 26 commits intomainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
6fbfff8 to
836a2bd
Compare
Collaborator
Author
|
/tag-and-rerun-ci |
Add CI registration to quantization tests: - quant/test_awq.py (163s) -> stage-b-test-small-1-gpu - quant/test_marlin_moe.py (200s) -> stage-b-test-small-1-gpu - test_bnb.py (5s) -> stage-b-test-small-1-gpu - test_gptqmodel_dynamic.py (102s) -> stage-b-test-small-1-gpu - test_quantization.py (185s) -> stage-b-test-small-1-gpu - test_gguf.py (96s) -> stage-b-test-small-1-gpu Add CI registration to accuracy tests: - test_eval_accuracy_large.py (300s) -> stage-b-test-small-1-gpu - test_moe_eval_accuracy_large.py (500s) -> stage-b-test-large-2-gpu Split performance tests by GPU requirements: - test_bench_serving_1gpu.py -> stage-b-test-small-1-gpu (5090-compatible) - Basic LLM throughput/latency, VLM, LoRA, score/embeddings API - test_bench_serving_1gpu_large.py -> stage-b-test-large-1-gpu - FP8, EAGLE speculative decoding (need H200/SM90+) - test_bench_serving_2gpu.py -> stage-b-test-large-2-gpu - MoE (TP=2), Pipeline Parallel (PP=2)
836a2bd to
7b18bf6
Compare
- Move quant tests (test_awq, test_marlin_moe, test_bnb, test_gptqmodel_dynamic, test_quantization, test_gguf) to test/registered/quant/ - Move accuracy tests (test_eval_accuracy_large, test_moe_eval_accuracy_large) to test/registered/eval/ - Remove quantization_test suite from test/srt/run_suite.py - Remove migrated tests from __not_in_ci__ section
- Split test_bench_one_batch.py into 1gpu and 2gpu versions - Move to test/registered/perf/ with CI registration - Remove performance-test-* jobs (tests now in stage-b suites) - Remove accuracy-test-* jobs (tests now in stage-b suites) - Remove from pr-test-finish needs list
- Create stage-b-test-small-1-gpu-performance job - Create stage-b-test-large-1-gpu-performance job - Create stage-b-test-large-2-gpu-performance job - Create stage-b-test-small-1-gpu-accuracy job - Create stage-b-test-large-2-gpu-accuracy job - Update test registrations to use new suite names - Add new jobs to pr-test-finish needs list
- Update performance test paths to test/registered/perf/ - Update accuracy test paths to test/registered/eval/ - Update class names to match new test file structure
…unner - Add timeout-minutes: 60 to all CI jobs for safety - Move test_bench_serving_1gpu.py from stage-b-test-small-1-gpu-performance to stage-b-test-large-1-gpu-performance since it requires more GPU memory and was timing out on the small GPU runner
- Change runs-on from 1-gpu-runner to 1-gpu-5090 - Add IS_BLACKWELL: "1" environment variable - Add source /etc/profile.d/sglang-ci.sh to Install and Run steps
The 5090 runner achieves ~100 tok/s but the test expects >135 tok/s. Moving this test to run on H200 where it passes. Note: stage-b-test-small-1-gpu-performance suite is now empty.
Add test_vlm_perf_5090.py with VLM offline throughput and online latency tests registered to stage-b-test-small-1-gpu-performance suite. These tests passed on 5090 in https://github.com/sgl-project/sglang/actions/runs/21022442069
Move test_awq.py and test_gptqmodel_dynamic.py to stage-b-test-large-1-gpu suite as they fail on 5090 GPUs. Reference: https://github.com/sgl-project/sglang/actions/runs/21022442069
Change from non-existent 1-gpu-runner-large to 1-gpu-runner (H200).
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
Eagle speculative decoding test with max_running_requests=64 causes KV cache pool to fill up and crash the server on 5090 (32GB VRAM). Move to H200 (80GB) where it has sufficient memory.
GPTQ-INT4 model scored 0.0 on 5090, indicating it failed to run properly. Move to H200 where quantization kernels work correctly.
Collaborator
Author
|
ci (need to increase timout): https://github.com/sgl-project/sglang/actions/runs/21084213109?pr=17177 |
- Increase Run test timeout from 20 to 30 minutes for stage-b-test-large-1-gpu-performance job - Add performance and accuracy suites to PER_COMMIT_SUITES to fix "Unknown suite" warning
- Increase step timeout from 30 to 40 minutes - Increase per-file timeout from 1200s to 1800s (30 min) test_bench_serving_1gpu.py was timing out at 97% completion (487/500).
- Split test_bench_serving_1gpu.py into: - test_bench_serving_1gpu_part1.py (LLM + LoRA tests, est_time=1000s) - test_bench_serving_1gpu_part2.py (VLM + Score + Embeddings, est_time=900s) - Add 2 partitions to stage-b-test-large-1-gpu-performance job - This fixes timeout issue where single file took 31min but had 30min timeout
Update test_bench_serving_1gpu → test_bench_serving_1gpu_part1 references in pr-test-amd.yml to match the split test files.
Collaborator
Author
|
ci passed for all stage b: https://github.com/sgl-project/sglang/actions/runs/21092654922?pr=17177 |
DotSlash-A
pushed a commit
to DotSlash-A/sglang
that referenced
this pull request
Jan 19, 2026
* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add CI registration to quantization tests in
test/srt/:quant/test_awq.py(163s) → stage-b-test-small-1-gpuquant/test_marlin_moe.py(200s) → stage-b-test-small-1-gputest_bnb.py(5s) → stage-b-test-small-1-gputest_gptqmodel_dynamic.py(102s) → stage-b-test-small-1-gputest_quantization.py(185s) → stage-b-test-small-1-gputest_gguf.py(96s) → stage-b-test-small-1-gpuAdd CI registration to accuracy tests:
test_eval_accuracy_large.py(300s) → stage-b-test-small-1-gpu (1-GPU)test_moe_eval_accuracy_large.py(500s) → stage-b-test-large-2-gpu (TP=2)Split performance tests by GPU requirements:
test_bench_serving_1gpu.py→ stage-b-test-small-1-gputest_bench_serving_1gpu_large.py→ stage-b-test-large-1-gputest_bench_serving_2gpu.py→ stage-b-test-large-2-gpuTest plan