Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3142 commits
Select commit Hold shift + click to select a range
da07b22
[Docs] quick fix delete --enable-dp-attention in sgl-jax (#24052)
JamesBrianD Apr 30, 2026
125f75d
fix(lora): avoid CUDA graph-breaking scalar assignment in seg_indptr …
yushengsu-thu Apr 30, 2026
99c0b62
allow requests with exactly context_len total tokens (#22546)
opherlieber Apr 30, 2026
583929c
fix the compatibility between --moe-dense-tp-size 1 and piecewise cud…
Qiaolin-Yu Apr 30, 2026
b1ef99f
[CI] Remove orphaned test/srt/ascend and test/srt/configs (#24145)
merrymercy Apr 30, 2026
577dbc4
[4/N] Quantization Refactor: AWQ schemes and Kernel call and weight i…
Alisehen Apr 30, 2026
aa74911
[NPU] fix some npu error with OffloaderV2 (#19541)
Hide-on-bushsh Apr 30, 2026
cf4f462
[AMD] Nightly image release for deepseek v4 (#24155)
yctseng0211 Apr 30, 2026
651af06
[Feature] Xiaomi MiMo-V2.5 day0 support (#23811)
Abatom Apr 30, 2026
7bb7f60
ci: add per-host utilization view to runner-utilization report (#24102)
alisonshao Apr 30, 2026
dc395bc
ci: run setup_ld_library_path before install_sglang_kernel (#24141)
alisonshao Apr 30, 2026
71e89e9
[MUSA][19/N] Support qwen series models (#23654)
froststeam Apr 30, 2026
c04b20d
Fix KeyError in prepare_lora_batch when lora_ids contains None (#21974)
erikwijmans Apr 30, 2026
340efca
[sgl-kernel] Prep for torch 2.11 upgrade and switch PyPI default to c…
Kangyan-Zhou Apr 30, 2026
989a161
[Bench] Fix bench_serving missing reasoning_content stream chunks (#2…
JustinTong0323 Apr 30, 2026
2e027b1
chore: bump sgl-kernel version to 0.4.2 (#24170)
sglang-bot Apr 30, 2026
d576715
Fix LFM2 ShortConv Mamba State Indexing (#23975)
hubertlu-tw Apr 30, 2026
f75a8b6
fix: support HybridLinearAttnBackend in TboAttnBackend (#20114)
lawrence-harmonic Apr 30, 2026
cdc4078
ci: add rebase-required mode to check-maintenance action (#23109)
alisonshao Apr 30, 2026
694ef51
Revert "[ci] split stage-c-test-4-gpu-b200 to enable a low-disk runne…
alisonshao Apr 30, 2026
e45b8ec
[CI] Publish nightly sglang wheel under both cu129 and cu130 indexes …
Kangyan-Zhou Apr 30, 2026
c5f1339
Revert "ci: add rebase-required mode to check-maintenance action" (#2…
Kangyan-Zhou Apr 30, 2026
8a9e424
Replace hardcoded CUDA device with get_device() for XPU support (#13599)
kalyank007 Apr 30, 2026
9c5cad3
Use device-agnostic helpers for Mamba tests and core ops (#20234)
roopaksrivastav Apr 30, 2026
918f910
ci: temporarily disable multimodal-gen-test-1-b200 (#24174)
alisonshao Apr 30, 2026
8b23d32
feat: implement workflow to sync LMSYS SGLang blog (#23438)
zijiexia Apr 30, 2026
e35ac95
[Test] Add XPU device support to unit tests (#22236)
singhalshubham03 Apr 30, 2026
da7f890
[Intel GPU] Integrate flash_mla_decode in Intel XPU attention backend…
polisettyvarma Apr 30, 2026
f672925
spec: gate dp mlp sync with server args (#24177)
happierpig Apr 30, 2026
5f88c85
[Misc] Redirect default sglang nightly wheel to cuda 130 (#24183)
Fridge003 Apr 30, 2026
cf346bb
[CI] Skip worker-dependent SMG e2e tests pending runner-image debug (…
Kangyan-Zhou Apr 30, 2026
1742bfb
[CI] release-whl-kernel: skip musa wheels in update_kernel_whl_index …
Fridge003 May 1, 2026
0acc569
[Bench] extend MMMU answer extractor with explicit-commit patterns (#…
AgainstEntropy May 1, 2026
108bfd8
[MoE] Add Aiter MoE runner backend and purge aiter.fused_moe from qua…
ch-wan May 1, 2026
9d84268
[diffusion] refactor: introduce component residency manager (#23771)
mickqian May 1, 2026
9d95783
Add Docker image provenance metadata (#24090)
alec-flowers May 1, 2026
357e378
ci: limit nightly test parallelism to 1 job per hardware type (#23314)
alisonshao May 1, 2026
7bc7775
[CI] Fix stage-b-test-4-gpu-b200 silently skipped, hanging wait-for-s…
Fridge003 May 1, 2026
8975479
[LoRA][MOE] Fix EP correctness in MoE LoRA slicing and virtual-expert…
jybsuper May 1, 2026
6fa499c
[diffusion] CI: remove parametrized-only from diffusion PR test (#24202)
mickqian May 1, 2026
a578bf8
[AMD] fix moriep unittest failure (#24205)
billishyahao May 1, 2026
87dad74
Add benchmark/hicache/bench_warm_cache.py for exact warm-cache shared…
clintg6 May 1, 2026
d9e8a4a
[SWA] Ensure we use pre-computed SWA cache location during prefill (#…
merrymercy May 1, 2026
d48095b
Bypass torch.cuda.use_mem_pool generator-CM in SymmetricMemoryContext…
ch-wan May 1, 2026
7e17a7a
Upd: AITER->(#2879)a6bb499 (#24218)
HaiShaw May 1, 2026
4197c55
[spec decoding] add tests for chain-style multi layer eagle + return_…
Qiaolin-Yu May 1, 2026
5b7ce41
[P/D disagg] - support decode side radix cache (#19746)
ishandhanani May 1, 2026
4a50cd7
[BugFix][HiMamba] Fix host-protected node deletion in HiMamba tombsto…
icepoint666 May 1, 2026
ece8a1a
Refactor device timer, clean up metrics collector, and add fwd occupa…
merrymercy May 1, 2026
c7566a8
Add CI permissions for user luccafong, 842974287 (#24238)
merrymercy May 1, 2026
b58fa60
[core/attention] Add SGLANG_FLASHINFER_USE_PAGED env to force paged w…
luccafong May 1, 2026
79bc250
[Bug Fix] Resolve EAGLE cuda graph IMA under PD + DP + MTP with GLM-5…
zRzRzRzRzRzRzR May 1, 2026
8a53046
[Bug] Size mamba mappings from req pool, not mamba pool (#24244)
hnyls2002 May 1, 2026
05de73e
[core/model] Use explicit model arch for Llama4 attention backend aut…
luccafong May 1, 2026
b47fab6
[bugfix] Support MIXED forward mode in TBO splitter for DP attention …
ch-wan May 1, 2026
cb8fbd5
Reserve slot 0 as padding in all req pools (#24243)
hnyls2002 May 1, 2026
193b977
[diffusion] chore: clean scheduler (#24229)
mickqian May 2, 2026
bfccc8e
Allow configuring NIXL backend parameters from env (#24169)
aurickq May 2, 2026
4c2ed9a
Flux2 nvfp4 quantization correctness on Blackwell (B200) (#23625)
Johnsonms May 2, 2026
321298d
[SKILL] Upgrade sglang profile and auto_benchmark skills (#24250)
BBuf May 2, 2026
d41e8c4
Support RunAI loading for quantized checkpoints (#23850)
sshleifer May 2, 2026
63f225c
[session] fix mamba pool leak in StreamingSession.release_session + p…
sshleifer May 2, 2026
b7d4647
[diffusion] CI: change ground truth repo (#24219)
mickqian May 2, 2026
674a80d
[CI] Add 1-gpu-h100-h200 to rerun-test runner_label choices (#24264)
Kangyan-Zhou May 2, 2026
cd27baa
[ci][cu13] Bump torch_memory_saver to 0.0.9.post1; restore manual tes…
Kangyan-Zhou May 2, 2026
5ec3b26
[diffusion] model: support JoyAI-Image-Edit (#22625)
lahmuller May 2, 2026
b939d54
[AMD] enable sdma for moriep unittest (#24259)
billishyahao May 2, 2026
2e72a36
[CI] Restore SMG e2e on 2-gpu-h100 / 4-gpu-h100 runners (#24222)
Kangyan-Zhou May 2, 2026
f3dbadb
fix: accept 0-indexed safetensors shard names in CI weight validator …
alisonshao May 2, 2026
589f90b
[diffusion] chore: use lmsys as org for modelopt checkpoints (#23924)
BBuf May 2, 2026
3259a2c
Encode routed_experts in the detokenizer, off the tokenizer hot path …
hnyls2002 May 2, 2026
ebbaab5
[NPU] Add GitHub test summary and deduplicate test code. Part 1 (#23835)
e-martirosian May 2, 2026
83bf5d6
[NPU]TP Communications compression For Qwen3 models for NPU (#20520)
egvenediktov May 2, 2026
1360848
Optimize large GroupNorm SiLU apply (#23938)
BBuf May 2, 2026
b712dd4
[codex] diffusion: enable group norm silu fuse by default (#23148)
BBuf May 2, 2026
4128f1f
[SKILLS] Tiny upgrade diffusion skills (#24273)
BBuf May 2, 2026
e0474fd
throw ValueError for DoRA adapters (#22125)
glenliu21 May 2, 2026
d74d9bd
[gateway] Align /v1/loads and /model_info with sglang server; drop de…
Kangyan-Zhou May 2, 2026
24a6b30
[CI] drop --prerelease allow from uv pip install suffix (#24265)
Kangyan-Zhou May 2, 2026
88bb5df
[Dependency] Upgrade to Torch 2.11.0 (#21247)
b8zhong May 2, 2026
200944b
Update kernel installation instructions after shifting default cuda t…
Fridge003 May 2, 2026
76b9c8d
[Feature] add LoRADrainer to address high P99 TTFT (#17913)
glenliu21 May 2, 2026
266d994
[CI] Fix nightly NV jobs cancelling each other via shared concurrency…
Kangyan-Zhou May 3, 2026
b7d8ceb
[CI] Keep custom sgl-kernel wheel in CUDA CI (#24291)
mmangkad May 3, 2026
fcc8b7b
Rename SGLANG_USE_JIT_ALL_REDUCE to SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_…
hnyls2002 May 3, 2026
8b12367
[diffusion] CI: add diffusion GT generation (official implementation)…
mickqian May 3, 2026
2bfc5d3
[diffusion] optimize LTX2.3 HQ denoising split passes (#24298)
mickqian May 3, 2026
44ca2d0
[pd]: (Bug Fix) Incorrect out_cache_loc slicing in prepare_for_prebui…
hzh0425 May 3, 2026
e37f46f
[NPU] Fix Z-Image negative-branch rotary embeddings for CFG (#23538)
gxxx-hum May 3, 2026
c0f5950
[UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree (…
hzh0425 May 3, 2026
5925572
[diffusion] CI: switch CI data references to sgl-project/ci-data (#24…
mickqian May 3, 2026
f2d1390
[Diffusion] Add Qwen Image ModelOpt FP8 support (#23155)
BBuf May 3, 2026
62265ca
[diffusion] feat: initial support for dynamic batching (#18764)
qimcis May 3, 2026
9a5450a
[PD]: Support incremental transfer for mooncake transfer engine (#24257)
hzh0425 May 3, 2026
53df43d
rerun-test: route deepep h200 suite to deepep runner (#24325)
hnyls2002 May 3, 2026
c3b6d20
Register deepseek_v32 alias instead of rewriting config.json (#24295)
hnyls2002 May 3, 2026
00d620b
introduce arg_groups/ with nemotron_h hook (#24328)
hnyls2002 May 3, 2026
c611a3f
[diffusion] chore: disable VAE cpu offload by default (#24315)
mickqian May 4, 2026
b7fefc0
feat(lora): enable csgmv backend with virtual experts for MoE LoRA (#…
yushengsu-thu May 4, 2026
91fa234
extract adjust_hybrid_swa_layers_for_pp (#24334)
hnyls2002 May 4, 2026
1dd8f6d
dedup state_kv_args setup into helper (#24340)
hnyls2002 May 4, 2026
a91ae6a
nextn subclass owns post_load_weights is_nextn (#24333)
hnyls2002 May 4, 2026
c2d60c3
chore(codeowners): add @kpham-sgl as codeowner for ngram files (#24349)
kpham-sgl May 4, 2026
aea527a
Fix swa chunk req deferred (#24318)
ispobock May 4, 2026
5eff3c4
[AMD] Deepseek v4 Flash / Pro nightly tests for MI35x ROCm 7.2 (#24203)
bingxche May 4, 2026
52b4609
[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup…
Kangyan-Zhou May 4, 2026
de08c80
Add release workflow for sgl-deep-gemm wheels (#24348)
Fridge003 May 4, 2026
ef2b1b6
Fix flashinfer workspace OOM (#24172)
kpham-sgl May 4, 2026
952b3ca
feat: use structural tags to enable strict tool calling and reasoning…
Seven-Streams May 4, 2026
84f3b44
[tiny] misc cleanups across configs, attention, jit_kernel (#24350)
hnyls2002 May 4, 2026
05aed5e
[UnifiedRadixTree]: Add KL accuracy CI for UnifiedTree with HiCache (…
hzh0425 May 4, 2026
4b6d446
[diffusion] chore: enable channels-last 3D VAE convs by default (#23200)
BBuf May 4, 2026
62a4df0
[docker] Fix silently-masked cubin download failure; skip prebuilt cu…
mispa-ms May 4, 2026
1be3163
[diffusion] fix: use direct all-to-all for USP collectives (#24366)
mickqian May 4, 2026
e5c58eb
[VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + …
yuan-luo May 4, 2026
c545a5b
[SMG][CI] Add K8s integration tests + wire into pr-test-rust (#24278)
Kangyan-Zhou May 4, 2026
e6f252e
Cache FlashInfer autotune configs (#24156)
sshleifer May 4, 2026
6dd7aeb
Minor scheduler fixes (#24359)
ispobock May 4, 2026
8ffd39e
[CI] Exclude flaky h20 stage from check-stage-health root cause set (…
Kangyan-Zhou May 4, 2026
60a1dac
[HiCache] return cached_tokens_details in sglext for streaming respon…
vladnosiv May 4, 2026
0f283a5
[Docker] fix: install nixl stub alongside nixl-cuXX binary (#24369)
Kangyan-Zhou May 4, 2026
e93bb63
[docs]update sm75 to sm80 min gpu support (#24336)
ppraneth May 4, 2026
29dd3a3
Refactor device timer installation and rename prefill prealloc to boo…
merrymercy May 4, 2026
f059e02
Fix sgl-deep-gemm release workflow (#24385)
Fridge003 May 4, 2026
4743cf6
misc: add marlin to moe runner choices; drop dead env var doc (#24384)
hnyls2002 May 4, 2026
d7c93e1
[sgl] reduce specdec cpu overhead (#23321)
2022tgoel May 4, 2026
6a62eab
consolidate NSA pool construction (#24389)
hnyls2002 May 4, 2026
4b487ca
[Fix] NGRAMWorker.update_weights_from_tensor — delegate to target wor…
stargazerZJ May 4, 2026
078f84d
[SKILL] Add diffusion benchmark presets for edit and Hunyuan3D models…
BBuf May 5, 2026
2f7d99b
[diffusion] cli: support component attention backend overrides (#24320)
mickqian May 5, 2026
2b769d3
(2/n - prefill optimize)perf(lora): remove GPU-CPU sync barrier (.ite…
yushengsu-thu May 5, 2026
e483e60
[diffusion] CI: pin diffusion consistency GT revision (#24400)
mickqian May 5, 2026
04926e1
[diffusion] feat: cache encoder results for default negative prompt (…
mickqian May 5, 2026
6279aee
[docs] Update B300 Pro cookbook with accuracy-verified serving config…
yhyang201 May 5, 2026
244531b
[AMD] Add Kimi-K2.6 in nightly tests for MI30x and MI35x (#23848)
michaelzhang-ai May 5, 2026
c2db19f
[AMD] Enable EAGLE speculative decoding for Qwen3.5 FP8 and MXFP4 mod…
hubertlu-tw May 5, 2026
6461354
[AMD] fix(docker): unbreak nightly when archive.ubuntu.com:80 is unre…
yctseng0211 May 5, 2026
177babc
[diffusion] optimize: fuse LTX2 split rotary embedding (#24411)
mickqian May 5, 2026
e0af36a
[Fix] Fix pypi release workflow (#24417)
Fridge003 May 5, 2026
80ccb6b
[AMD] fix tbo specv2 seq_lens_cpu NoneType error (#24319)
billishyahao May 5, 2026
8c703f2
Add HunyuanVideo ModelOpt FP8 diffusion support (#23199)
BBuf May 5, 2026
67e8bd7
[codex] Optimize Helios fused norm modulation (#24059)
BBuf May 5, 2026
9fb9a1c
[sgl] expose swa and mamba cache metrics (#24396)
bixue2010 May 5, 2026
d228534
Fix deterministic inference on models with `SWAKVPool` (#24395)
kpham-sgl May 5, 2026
e299ec1
[UnifiedRadixTree]: Fix flaky ci (#24421)
hzh0425 May 5, 2026
fdfc46f
[Intel GPU] Enable DeepSeek V3.2 inference on XPU (#24356)
polisettyvarma May 5, 2026
cc54d8e
[diffusion] chore: clean CUDA cache only at explicit release points (…
mickqian May 5, 2026
932d896
Gemma4-mtp cookbook (#24433)
kpham-sgl May 5, 2026
d23ef40
[diffusion] fix: fix RowParallel LoRA merged forwarding (#24410)
mickqian May 5, 2026
8d1b6f0
Add zRzRzRzRzRzRzR to CI permissions (#24432)
JustinTong0323 May 5, 2026
c4c0376
consolidate routed-experts capturer onto reusable base (#24403)
hnyls2002 May 5, 2026
e0faed8
[Docs] Add B200, GB200, GB300 NVIDIA hardware platform support for Ki…
zijiexia May 5, 2026
83b48fd
[codex] update Nemotron3 Nano Omni cookbook benchmarks (#23998)
zijiexia May 5, 2026
47a416f
add indexer-topk capture (V3.2 NSA + infra) (#24392)
hnyls2002 May 5, 2026
08d4c20
move topk capturers to srt/state_capturer/ (#24450)
hnyls2002 May 5, 2026
6764155
chore: bump sgl-kernel version to 0.4.2.post1 (#24457)
sglang-bot May 5, 2026
431ca54
[fix] /pause_generation and /continue_generation wrong for --tokenize…
maocheng23 May 5, 2026
710fed1
Revert "[fix] /pause_generation and /continue_generation wrong for --…
merrymercy May 5, 2026
1e404af
fix(req_pool): bump pool.size to match actual tensor row count after …
JustinTong0323 May 5, 2026
46bde1f
Add fwd_occupancy metric to SchedulerStats and Prometheus collector (…
merrymercy May 6, 2026
64f80ea
Register aten::rms_norm and aten::mm.dtype in batch invariant mode (#…
merrymercy May 6, 2026
22cf7d2
[Fix] Handle nixlRemoteDisconnectError in NixlKVSender (#24296)
cctry May 6, 2026
b91b05a
Add --random-input-len to send_one.py (#24464)
merrymercy May 6, 2026
a965f88
[misc] update CI_PERMISSIONS.json (#24468)
Qiaolin-Yu May 6, 2026
c7019ff
[NIXL][XPU] Use np.uint64 for pointer/length arrays in disaggregation…
Jianhong-Zhang May 6, 2026
d7385b5
[Diffusion] Optimize Hunyuan3D shape denoising (#24287)
BBuf May 6, 2026
b2420d7
[RL] DeepEP support for `--enable-return-routed-experts` (#16859)
PrinsYin May 6, 2026
3da8790
[HiSparse] Support FP8 KV cache by routing to flashmla_kv backend (#2…
whybeyoung May 6, 2026
660a77f
Silence noisy health-check race log in TokenizerManager (#24466)
cctry May 6, 2026
fbebfde
[diffusion] fix: fix diffusion FSDP sharding (#24431)
mickqian May 6, 2026
094b90b
ci: drop 1-gpu-h100-h200 shared label (#24495)
alisonshao May 6, 2026
2317222
chore: Gitignore Claude scheduled_tasks.lock (#24505)
fzyzcjy May 6, 2026
61104d7
Add prefixed _log helper in dumper (#24506)
fzyzcjy May 6, 2026
58487e6
Support cross-system tensor grafting in dumper (#24507)
fzyzcjy May 6, 2026
9a65f0a
Support t2b direction and overlap protection in dumper grafter (#24508)
fzyzcjy May 6, 2026
ebd64f5
Support user-supplied recv-side transform in dumper grafter (#24509)
fzyzcjy May 6, 2026
833279e
Support multi-rank exchange via all_gather_object in dumper grafter (…
fzyzcjy May 6, 2026
75943cf
Support per-call extras and dataclass transform input in dumper graft…
fzyzcjy May 6, 2026
8527db0
Enhance diff and tensor-info logging in dumper grafter (#24512)
fzyzcjy May 6, 2026
a858fda
Add e2e test with log snapshot in dumper grafter (#24513)
fzyzcjy May 6, 2026
c8bc235
Refactor: decouple segment tracking from comm registration (#21392)
wangfakang May 6, 2026
b67df7c
[Codex] Diffusion handle non-contiguous CFG communication (#24332)
BBuf May 6, 2026
7ec18f7
[Doc] Fix instruction on Cuda 13 environments (#24516)
Fridge003 May 6, 2026
11b0e51
Fix lint (#24520)
fzyzcjy May 6, 2026
163bf1b
[PD] Fix KV transfer metrics (#24416)
cctry May 6, 2026
d4d4b04
[PD] Fix missing update_status call in abort() across all KV backends…
merrymercy May 6, 2026
864f963
Cherry pick weight_checker fp8 dequant fix and non-persistent buffer …
fzyzcjy May 6, 2026
0d40931
Cherry pick weight_checker non-persistent buffer pattern list from #2…
fzyzcjy May 6, 2026
491051c
Cherry pick weight_checker `_weight_fp32` buffer skip from #22663 (#2…
fzyzcjy May 6, 2026
eb5f0fb
Support swa HiCache for unified radix cache (#23391)
ispobock May 6, 2026
800deaa
Add unit and end-to-end tests for weight checker (#24536)
fzyzcjy May 6, 2026
ae5ae84
Refactor buffer patterns in weight checker (#24538)
fzyzcjy May 6, 2026
c4c5541
Support getting checksums in weight checker (#24537)
fzyzcjy May 6, 2026
bfc1aea
[CP] Register KV cache allgather buffer with symmetric memory (#24040)
wangfakang May 6, 2026
32d9998
[PD] Prevent update_status to Failed from cleared entries (#24539)
ShangmingCai May 6, 2026
d86f291
Fix diffusion fallback guards and validation (#23335)
BBuf May 6, 2026
b859f7f
Improve metrics, observability, and PD deploy tooling (#24521)
merrymercy May 6, 2026
bc70488
[CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (…
fortunecookiee May 6, 2026
e72246c
ci: bump test_mimo_models.py est_time 330 → 610 (#24551)
alisonshao May 6, 2026
ece7e95
[LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_h…
gh1595 May 6, 2026
9e1336d
[Misc] Fix breaking weight checker test (#24553)
Fridge003 May 6, 2026
a9a8b20
[codex] Optimize Z-Image packed QKV (#24117)
BBuf May 6, 2026
7fa4891
Expand support matrix for pypi wheel release (#24565)
Fridge003 May 7, 2026
2e642ea
[diffusion] chore: align LTX-2 with official (#24313)
mickqian May 7, 2026
3fe8bc9
Support Triton MLA FP8 KV cache (#20479)
b8zhong May 7, 2026
27445f9
Add ChatCompletionRequest-style support to /v1/tokenize (#23981)
huangtingwei9988 May 7, 2026
4a279d9
[R3] Avoid implicit CUDA sync in routed experts DP slicing (#24550)
zyzshishui May 7, 2026
eaf074d
propagate pytest exit code from test __main__ entries (#24487)
hnyls2002 May 7, 2026
ecb786c
[Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-…
Fridge003 May 7, 2026
65ce996
[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, …
junliu-mde May 7, 2026
9ba4aac
Fix weight_checker e2e OOM on 32GB GPU + move to nightly (#24559)
fzyzcjy May 7, 2026
8bb0ce1
Add bypass-fastfail label for check-stage-health (#24577)
hnyls2002 May 7, 2026
d363315
fix(router): make HTTP pool idle timeout configurable (#24329)
revanthreddy-hai May 7, 2026
f0368a6
[LoRA] Use deterministic lora_id for --lora-paths so multi-node ranks…
jybsuper May 7, 2026
a2586f1
[CI] pin NeMo-Skills install to known-good SHA in accuracy_test_runne…
Kangyan-Zhou May 7, 2026
9dfb1d2
[Intel GPU] Fix flash_mla_get_workspace_size call in intel_xpu (#24372)
polisettyvarma May 7, 2026
684638e
Fix prefill batch iter logging under overlap (#20845)
weireweire May 7, 2026
a6dc495
Let bypass-fastfail label skip stage-to-stage wait (#24598)
hnyls2002 May 7, 2026
92f281f
[Spec][trtllm] use decode kernel for draft extend (#24566)
hanming-lu May 7, 2026
3b2c730
[AMD] Enable dual-stream MoE on ROCm (#24005)
inkcherry May 7, 2026
e264b57
[PD] Centralize per-room cleanup in common backend (#24601)
ShangmingCai May 7, 2026
ec45603
[Bug Fix] Preserve decode state across retract-resume of GLM-5.1 (#23…
zRzRzRzRzRzRzR May 7, 2026
811d138
Nixl async transfer (#23967)
ovidiusm May 7, 2026
263cb3b
[LoRA] Torch Native enhancement: embedding and graph optimization (#2…
vlserov May 7, 2026
9c41b10
[diffusion] refactor: refactor cfg parallelism framework to support m…
ykcai-daniel May 7, 2026
c188790
[AMD] Route PR multimodal tests to MI325 (#24614)
yctseng0211 May 7, 2026
b0225a6
[diffusion] optimize: precompute LTX2 guidance perturbation states (#…
mickqian May 7, 2026
7d397ad
[NPU]Support model Trinity-mini for Npu, accuracy 90% (#18172)
McZyWu May 7, 2026
80a6014
:sparkles: [diffusion][npu][quant] Add MXFP8 quantization support for…
TallMessiWu May 7, 2026
90a618e
[NPU][diffusion] add selectable parallel VAE decode strategies (#23248)
gxxx-hum May 7, 2026
be088f8
fix(router): configure HTTP client connection settings (#24330)
revanthreddy-hai May 7, 2026
9cffa5e
[MUSA] Bump torchada to 0.1.54 (#24592)
yeahdongcn May 7, 2026
f1395af
fix(openai): map reasoning.enabled to thinking AND enable_thinking (#…
JustinTong0323 May 7, 2026
d2c1034
[Gemma 4] Adding MTP support (#24436)
kpham-sgl May 7, 2026
d8f9d32
feat(reasoning): auto-detect reasoning/tool-call parser from chat tem…
JustinTong0323 May 7, 2026
af2a2ac
fix(function_call): handle Kimi-K2.5 bare numeric tool call IDs (#23950)
JustinTong0323 May 7, 2026
5b589ed
feat(constrained): two-phase reasoning grammar + --enable-strict-thin…
JustinTong0323 May 7, 2026
c2c5706
fix(http): apply SGLANG_TIMEOUT_KEEP_ALIVE in common.py (#24323)
revanthreddy-hai May 7, 2026
95fb722
Add registry for custom speculative algorithms (#23991)
hnyls2002 May 7, 2026
c4bb3ce
Fix stuck when enabling MTP on DSA models (#24635)
Fridge003 May 8, 2026
3c3f0bd
Cache empty MatchResult in RadixCache (#24470)
merrymercy May 8, 2026
55224ff
Add Arm64 CPU Phase 1A CI bootstrap (#22123)
ranimandepudi May 8, 2026
35870d5
Deepseek V4 (#23882)
hnyls2002 May 8, 2026
d9dddd4
[SPEC V2][2/N] feat: adaptive spec support spec v2 (#23336)
alphabetc1 May 8, 2026
5fa3bb2
Enable `flashinfer::trtllm_allreduce_fusion` with PDL (#23765)
b8zhong May 8, 2026
461bc8a
[NPU][Doc] Update GLM-5 docs, enabling deepep by default (#23708)
cen121212 May 8, 2026
15e6572
[MUSA][18/N] Add MUSA-optimized kernel implementations for hot ops (#…
Joey-gvwal May 8, 2026
aaf0803
feat: SM120 (Blackwell Desktop) DSv4-Flash support — rebased on main
AliceChenyy May 8, 2026
5858131
fix: add backend positional arg to flash_mla_with_kvcache_entrypoint …
AliceChenyy May 8, 2026
e7c7a53
fix(sm120): route DSv4 MXFP4 MoE to SM120 Triton fallback
AliceChenyy May 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
630 changes: 630 additions & 0 deletions .claude/skills/add-jit-kernel/SKILL.md

Large diffs are not rendered by default.

367 changes: 367 additions & 0 deletions .claude/skills/add-sgl-kernel/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,367 @@
---
name: add-sgl-kernel
description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
---

# Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.

## Goal

Add a new operation that scales each element of a tensor by a scalar factor:

- Input: tensor `x` (CUDA) and scalar `factor` (float)
- Output: `x * factor` (element-wise, in-place or into pre-allocated `out`)
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
- Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`)

## Two rules of thumb (must follow)

1. **Prefer `python/sglang/jit_kernel` first** when the kernel does **not** depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
2. **Prefer `sgl-kernel`** when the kernel **does** depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
3. **Exception**: if the dependency is `flashinfer`, or CUTLASS that is already provided through `flashinfer`, the kernel can still be implemented as `jit_kernel`.

In addition, every new kernel must ship with:

- **Tests** (pytest)
- **A benchmark script** (triton.testing)

---

## Repository integration map

You will typically touch these files/areas:

- Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory)
- Public declarations: `sgl-kernel/include/sgl_kernel_ops.h`
- Torch extension registration: `sgl-kernel/csrc/common_extension.cc`
- Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`)
- Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py`
- Tests: `sgl-kernel/tests/test_scale.py`
- Benchmarks: `sgl-kernel/benchmark/bench_scale.py`

---

## Step 1: Implement the kernel in `csrc/`

Pick the right subdirectory:

- `csrc/elementwise/` — for element-wise ops (our example)
- `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories

Create `sgl-kernel/csrc/elementwise/scale.cu`:

```cpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>

#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16

// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
const T* __restrict__ input,
float factor,
int64_t n) {
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
}
}

void scale(at::Tensor& out, const at::Tensor& input, double factor) {
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
"out and input must have the same dtype");

const int64_t n = input.numel();
const int threads = 256;
const int blocks = (n + threads - 1) / threads;

const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));

// Dispatches over float, float16, bfloat16
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
static_cast<c_type*>(out.data_ptr()),
static_cast<const c_type*>(input.data_ptr()),
static_cast<float>(factor),
n);
cudaError_t status = cudaGetLastError();
TORCH_CHECK(status == cudaSuccess,
"scale_kernel launch failed: ", cudaGetErrorString(status));
return true;
});
}
```

**Key points:**

- Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream
- Keep Python wrappers thin; do shape/dtype/device validation in C++ right around the launch path
- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16)
- Add device error checking after every kernel launch
- If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests

---

## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`

Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section:

```cpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);
```

---

## Step 3: Register the op in `csrc/common_extension.cc`

Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`:

```cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);
```

**Key points:**

- `Tensor!` means in-place / mutable output argument
- The schema is important for `torch.compile` and for consistent call signatures
- Keep the torch schema in PyTorch scalar types (`float` here), but note that the C++ launcher signature still needs `double` for scalar arguments accepted by `torch::Library`

---

## Step 4: Add the new source file to `CMakeLists.txt`

Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`:

```cmake
csrc/elementwise/scale.cu
```

**Key points:**

- Keep the list **alphabetically sorted** (the file explicitly requires this)
- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic

---

## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`

Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:

- implement the Python wrapper in `sgl-kernel/python/sgl_kernel/elementwise.py`
- then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py`

For example, in `sgl-kernel/python/sgl_kernel/elementwise.py`, add:

```python
import torch

def scale(
input: torch.Tensor,
factor: float,
out: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Element-wise scale: out = input * factor.

Supported dtypes: torch.float16, torch.bfloat16, torch.float32.

Parameters
----------
input : CUDA input tensor
factor : scale factor (float)
out : optional pre-allocated CUDA output tensor (same shape/dtype as input)
"""
if out is None:
out = torch.empty_like(input)
torch.ops.sgl_kernel.scale.default(out, input, factor)
return out
```

Then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` following the existing import style used by other kernels.

---

## Step 6: Write tests (required)

Create `sgl-kernel/tests/test_scale.py`:
```python
import pytest

import torch
import sgl_kernel

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)

result = sgl_kernel.scale(input, factor, out=out)
assert result is out

expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)


def test_scale_shape_mismatch():
input = torch.randn(128, dtype=torch.float16, device="cuda")
out = torch.empty(256, dtype=torch.float16, device="cuda")
with pytest.raises(RuntimeError, match="same shape"):
sgl_kernel.scale(input, 2.0, out=out)


def test_scale_cpu_input():
input = torch.randn(128, dtype=torch.float16) # CPU
out = torch.empty_like(input)
with pytest.raises(RuntimeError, match="CUDA"):
sgl_kernel.scale(input, 2.0, out=out)


if __name__ == "__main__":
import sys
sys.exit(pytest.main([__file__, "-q"]))
```

---

## Step 7: Add a benchmark (required)

Create `sgl-kernel/benchmark/bench_scale.py`:

```python
import itertools

import torch
import triton
import triton.testing

import sgl_kernel
from sglang.utils import is_in_ci

IS_CI = is_in_ci()

dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K
factors = [2.0]

configs = list(itertools.product(dtypes, sizes))


def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
return input * factor


@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dtype", "size"],
x_vals=configs,
line_arg="provider",
line_vals=["sglang", "torch"],
line_names=["SGL Kernel", "PyTorch"],
styles=[("green", "-"), ("red", "--")],
ylabel="µs (median)",
plot_name="scale-performance",
args={},
)
)
def benchmark(dtype, size, provider):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
factor = 2.0

if provider == "sglang":
fn = lambda: sgl_kernel.scale(input, factor, out=out)
else:
fn = lambda: torch_scale(input, factor)

ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms


if __name__ == "__main__":
benchmark.run(print_data=True)
```

---

## Step 8: Build

Build:

```bash
cd sgl-kernel
make build -j16
```

If you need to limit host resource usage:

```bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
```

---

## Step 9: Validate

After building successfully, run the test and benchmark:

```bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py
```

PR CI also runs `pr-test-sgl-kernel.yml`, including the B200 job
`sgl-kernel-b200-test` when kernel changes are detected. Use that job as the
Blackwell coverage signal for AOT `sgl-kernel` changes.

---

## Troubleshooting

- **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1`
- **Memory errors**: `compute-sanitizer --tool memcheck python ...`
- **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS`
- **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py`
- **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time

---

## References

- `sgl-kernel/README.md`
- `sgl-kernel/include/sgl_kernel_ops.h`
- `sgl-kernel/csrc/common_extension.cc`
- `sgl-kernel/CMakeLists.txt`
- `sgl-kernel/include/utils.h` — `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends
- `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern

## Summary of Files Created/Modified

```
sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper
sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: re-export Python API
sgl-kernel/tests/test_scale.py # NEW: tests
sgl-kernel/benchmark/bench_scale.py # NEW: benchmark
```
Loading
Loading