[NV] Update Qwen3.5 FP4 B200 SGLang#1018
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
There was a problem hiding this comment.
This PR is marked [WIP] — deferring for human review once it's ready.
Extended reasoning...
Overview
The PR modifies two files: the nvidia-master.yaml CI config (adding a tp=2 search space entry and narrowing tp=4 to conc=4 only) and the qwen3.5_fp4_b200.sh benchmark launch script (removing deprecated env exports and conditional TP8 logic, updating server flags including --expert-parallel-size, --enable-symm-mem, --tokenizer-path, and tuning memory/prefill parameters).
Security Risks
No security risks. These are benchmark configuration and shell script changes with no auth, crypto, or data-access implications.
Level of Scrutiny
Low-risk changes — benchmark scripts and CI sweep configs are not production-critical paths. The changes are consistent in intent and follow established patterns in the repo.
Other Factors
The PR title includes [WIP], indicating the author considers it incomplete. No bugs were found. A human should confirm the PR is finished before merging.
| search-space: | ||
| - { tp: 4, ep: 1, conc-start: 4, conc-end: 128 } | ||
| - { tp: 4, ep: 1, conc-start: 4, conc-end: 4 } | ||
| - { tp: 2, ep: 2, conc-start: 4, conc-end: 128 } |
There was a problem hiding this comment.
Please change:
- { tp: 2, ep: 2, conc-start: 4, conc-end: 128 }
to: - { tp: 2, ep: 1, conc-start: 4, conc-end: 128 }
| search-space: | ||
| - { tp: 4, ep: 1, conc-start: 4, conc-end: 128 } | ||
| - { tp: 4, ep: 1, conc-start: 4, conc-end: 4 } | ||
| - { tp: 2, ep: 2, conc-start: 4, conc-end: 128 } |
There was a problem hiding this comment.
{ tp: 2, ep: 2, conc-start: 4, conc-end: 128 }
to:
{ tp: 2, ep: 1, conc-start: 4, conc-end: 128 }
|
@claude - add PR description |
|
Claude finished @hshrivastava-droid's task in 1m 3s —— View job Adding PR Description
PR description has been added with a summary of all changes across the three modified files:
|
functionstackx
left a comment
There was a problem hiding this comment.
lgtm. ty for the recipe sgl-project/sgl-cookbook#264
|
|
||
| - config-keys: | ||
| - qwen3.5-fp4-mi355x-sglang | ||
| description: | ||
| - "TP2/TP4 seach space exploration for Qwen3.5 fp4 on SGL" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1022 | ||
There was a problem hiding this comment.
@Ankur-singh one small nit, dont remove whitespace plz or it will break
f5dd427 to
ca9fc7d
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25232812945 |
Summary
Update Qwen3.5 FP4 B200 SGLang benchmark configuration and launch script for improved throughput sweeps and tuned server parameters.
Config changes (
.github/configs/nvidia-master.yaml)lmsysorg/sglang:v0.5.10.post1-cu130→lmsysorg/sglang:nightly-dev-20260422-de962f32tp: 2, ep: 1, conc: 4-128sweep for both 1k1k and 8k1k sequence lengthstp: 4, ep: 1, conc: 4as low-latency baselineScript changes (
benchmarks/single_node/qwen3.5_fp4_b200.sh)--max-running-requests $CONC(let SGLang auto-manage)--max-prefill-tokensand--chunked-prefill-sizefrom 81920 → 16384--scheduler-recv-interval: 30 when CONC > 4, 10 otherwise--enable-symm-mem,--tokenizer-path,--mem-fraction-static 0.8,--stream-interval 50Changelog (
perf-changelog.yaml)qwen3.5-fp4-b200-sglangreflecting all above changes