Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
924 commits
Select commit Hold shift + click to select a range
0b3ddbc
[fix typo] seperated_timestep -> separated_timestep (#19622)
SoluMilken Mar 1, 2026
98224de
[Bugfix] Add missing auto_create_handle_loop to communicator methods …
Kangyan-Zhou Mar 1, 2026
8a0b757
[Test] add unit test for skipping already preempted request (#18912)
glenliu21 Mar 1, 2026
f51ddba
feat: add FA4 SM90 paged KV decode support & update attention docs (#…
zwang86 Mar 2, 2026
20282f5
[fix typo] expert_indicies -> expert_indices (#19627)
SoluMilken Mar 2, 2026
e3e71f2
docs: refactor speculative decoding doc (#19186)
alphabetc1 Mar 2, 2026
e5edf22
[WIP]enable mxfp8 on nvidia sm120 (#19112)
wolfcomos Mar 2, 2026
0a6678b
[PD] Remove unused server args for disaggregation (#19618)
ShangmingCai Mar 2, 2026
f6ee6dc
[JIT-kernel] Add unit test for nsa indexer fused_store_k_cache (#19389)
yuan-luo Mar 2, 2026
57c5c34
[diffusion] model: support Hunyuan3D-2 (#18170)
Prozac614 Mar 2, 2026
ec97754
Add bisect ci claude code skill (#19649)
Kangyan-Zhou Mar 2, 2026
0e53cee
[CI] Disable test_lora_tp CUDA CI during H100 to H200 transition (#19…
sglang-bot Mar 2, 2026
922aad2
Cleanup disagg decode prebuilt flow and add cross-stream sync in merg…
Baidu-AIAK Mar 2, 2026
4726073
Fix mamba2 mixer ci test (#19658)
ispobock Mar 2, 2026
07ef5f7
Remove sync points in mamba cache + prefill cudagraph plumbing for DP…
YazhiGao Mar 2, 2026
e42fa00
[Diffusion] diffusion profile and opt skills (#19540)
BBuf Mar 2, 2026
f7da379
feat: TTL-based prefix pinning with refresh-on-hit for HiRadixCache (…
ishandhanani Mar 2, 2026
15af26d
Add aiter attention support in prefill-attention-backend of gpt-oss (…
kkHuang-amd Mar 2, 2026
98f47d8
[AMD] Add Qwen3-Coder-Next accuracy and functionality test scripts fo…
yichiche Mar 2, 2026
f2c5503
[AMD] AMD AITER Scout Workflow (#19467)
yctseng0211 Mar 2, 2026
2e15c01
[diffusion] feat: Add --model-id for config resolution; deprecate mod…
mickqian Mar 2, 2026
ec44bc8
Support presets and arbitrary skipping keys in dump comparator (#19676)
fzyzcjy Mar 2, 2026
15e83ee
Enhance replication check, matching pattern, logging in dump comparat…
fzyzcjy Mar 2, 2026
a70dd11
Support flattened dims in dump comparator (#19678)
fzyzcjy Mar 2, 2026
6980416
Support non orthogonal parallel axes and explicit replication annotat…
fzyzcjy Mar 2, 2026
abdc0ee
Support directory detection in dump comparator (#19680)
fzyzcjy Mar 2, 2026
3ebd85b
Enhance sglang engine dumping tests in dump comparator (#19681)
fzyzcjy Mar 2, 2026
5bf3deb
Trace execution information in dump comparator (#19682)
fzyzcjy Mar 2, 2026
3dd4649
Beautify text output in dump comparator (#19683)
fzyzcjy Mar 2, 2026
e5ef845
Support multiple verbosity in dump comparator (#19684)
fzyzcjy Mar 2, 2026
7579ab3
Enhance error resilience in dump comparator (#19685)
fzyzcjy Mar 2, 2026
da2a024
Add GLM45 tool interruption support (#17714)
Leoyzen Mar 2, 2026
8df9b8d
[diffusion] fix: skip USP for cross-attention with replicated KV for …
AichenF Mar 2, 2026
3f36f27
[Bugfix] Fix nixl and mori backend for missing decode tp size in PD m…
ShangmingCai Mar 2, 2026
b371898
[Feature] add feature mla_ag_after_qlora for dsv3.2 (#19428)
JiaruiChang5268 Mar 2, 2026
eaf18eb
[sgl]add pin_mem to avoid cpu->gpu copy sync point (#19590)
bixue2010 Mar 2, 2026
714c53d
[NPU] support PD disaggregation on ascend when using PP (#14908)
Hexq0210 Mar 2, 2026
53de53f
[jit_kernel] Tiny unify jit_kernel tests style (#19694)
BBuf Mar 2, 2026
5833ea6
[diffusion] fix: make input/output file save paths configurable and d…
ruihanglix Mar 2, 2026
2d183c4
[Feat] add PP Support for minimax-m2 series (#19577)
LuYanFCP Mar 2, 2026
468e3dc
[Qwen3.5] Set full attn_backend to trtllm_mha on SM100 by default whe…
hlu1 Mar 2, 2026
c64274c
Piecewise Cuda Graph set default (#16331)
Oasis-Git Mar 2, 2026
0595085
[Diffusion] [NPU] Add CI tests for FLUX (#19001)
Makcum888e Mar 2, 2026
bdffb02
[CI] fix: handle missing repo in lora notebook (#19700)
alphabetc1 Mar 2, 2026
51ee17c
[diffusion] move skills dir (#19697)
BBuf Mar 2, 2026
cc860a2
[TestFix] change LoRA tests to use NVIDIA adapter instead of Nutanix …
glenliu21 Mar 2, 2026
3f9fc8b
[Qwen3.5] Fix missing `quant_config` in `Qwen3VL` (#19291)
mmangkad Mar 2, 2026
6822941
[FlashInfer] Bump FlashInfer version from 0.6.3 to 0.6.4 (#19005)
mmangkad Mar 3, 2026
145ae51
[Diffusion] Revert 18619 (#19510)
BBuf Mar 3, 2026
e6e02ec
[diffusion]: Add model detectors and warning for quantized diffusion …
Ratish1 Mar 3, 2026
8b4c387
[AMD] Update AITER Scout Workflow (#19735)
yctseng0211 Mar 3, 2026
1041f24
[NPU]grok2 model support (#17119)
KnightLTC Mar 3, 2026
8dfb6e1
[HiCache] fix compatibility bugs with eagle and HiCacheStorage (#19570)
huangtingwei9988 Mar 3, 2026
fe9d85d
Fix CompressedTensorsMxInt4MoE abstract method and relax GPQA baselin…
alisonshao Mar 3, 2026
060720c
[AMD] AMD new CI runner (#19739)
yctseng0211 Mar 3, 2026
6b8e62f
[AMD] [Qwen 3.5 Day 0] Add Qwen 3.5 nightly accuracy tests (#19479)
michaelzhang-ai Mar 3, 2026
0abb9f4
Piecewise Cuda Graph Docs (#19738)
Oasis-Git Mar 3, 2026
63003a3
[BUG] Support tuple hidden_states from fused MXFP4/FP8 quantization (…
zyzshishui Mar 3, 2026
dbf1247
Add KimiK2Detector with tool interruption support (#19696)
JustinTong0323 Mar 3, 2026
43a9249
CI: update CI_PERMISSIONS.json (#19744)
mickqian Mar 3, 2026
3c01b44
[Fix] NPU deepep hccl buffer and fix IPC safe check (#17804)
1StepForever Mar 3, 2026
62480eb
[SGLang-Diffusion] Fix custom op fake impl missing eps default for to…
zhaochenyang20 Mar 3, 2026
07b8d76
feat: Add FP8 KV cache support for Triton attention backend (#18882)
zack041 Mar 3, 2026
7a2d3df
Apply default stream to priority 0 in scheduling. (#16438)
hnyls2002 Mar 3, 2026
6af0448
[Bugfix] Catch errors when DeepSeek-V3.2 generates malformed JSON (#1…
Muqi1029 Mar 3, 2026
af0d35b
Fix: Reject requests with a duplicate request ID which can cause serv…
pyc96 Mar 3, 2026
4c95953
Fix/nemotron mtp quantaized (#19433)
shaunkotek Mar 3, 2026
666caaf
[Tool Call] Stream DeepSeek-V3.2 function call parameters in JSON for…
Muqi1029 Mar 3, 2026
365ca1e
[NPU] bugs fix: fix a condition bug when using speculative inference …
shengzhaotian Mar 3, 2026
facde4c
[PD] Enable all CP ranks for KVCache transfer (#19765)
ShangmingCai Mar 3, 2026
d939e26
[model gateway][0/N] router EPD support: add encoder grpc server back…
JasonZhang517 Mar 3, 2026
c6377bb
feat(gdn): add FlashInfer K-last SSM layout support for GDN prefill a…
xutizhou Mar 3, 2026
daabfe7
[hotfix] fix apply function name compressed_tensors_w4a4_mxint4_moe (…
TamirBaydasov Mar 3, 2026
95e4a25
[fix typo]: funtion -> function (1 line change) (#19790)
SoluMilken Mar 3, 2026
85f7a0a
feat: support Kimi K2.5 for Eagle3 (#19689)
yefei12 Mar 3, 2026
05f68e1
[AMD] Fix the hipDeviceGetName issue in ROCm based docker images (#19…
hubertlu-tw Mar 3, 2026
2e1b9e2
Fix routed_dp_rank boundary validation (#19762)
doujiang24 Mar 3, 2026
cedb86a
Feature:Reserve HTTP server port before model loading to immediately …
xrwang8 Mar 3, 2026
d6ac5f2
[Docs] Add GDN attention backends matrix documentation (#19755)
zwang86 Mar 3, 2026
85ab6a7
cli: Add lazy imports and fail-fast config validation (RFC #9853) (#1…
karthikreddy-02 Mar 3, 2026
ac2819c
Fix assertion tolerance for bf16 precision in triton attention UT (#1…
rahulvijayaraghavan Mar 3, 2026
b0f2669
feat(benchmark script): add similar to vllm --ready-check-timeout-sec…
almaslof Mar 3, 2026
f749802
[Score API][18132] return token usage in Score API response (#18381)
GuyStone Mar 3, 2026
dc92f88
Enhance bench_multiturn.py with OpenAI API support and richer metrics…
Kangyan-Zhou Mar 3, 2026
5b2e275
Enable XQA for SM90 and SM120 (#17115)
samuellees Mar 3, 2026
1135e21
[CI] support `/rerun-ut` command in slash handler (#19800)
hnyls2002 Mar 3, 2026
9305f0e
Support `triton_kernels` for GPT-OSS on SM120 (#19718)
b8zhong Mar 3, 2026
f7897de
[Feature] Improve weight loading log (#18651)
ppraneth Mar 3, 2026
fb37c0a
[args] Add Expert Parallelism Argument To SRT Runner (#18492)
Jonahcb Mar 3, 2026
0c760c4
Add tuned triton==3.5.1 b200 tp2, tp4 for qwen 3 next (#15917)
vincentzed Mar 3, 2026
b8c71f8
Add tuned triton==3.5.1 h200 tp2, tp4 for qwen 3 next (#15948)
vincentzed Mar 3, 2026
069c7e4
Fix CI failures (#19303)
CaoE Mar 3, 2026
753da27
[Bugfix] fix parse_lscpu_topology bug (#18520)
1195343015 Mar 3, 2026
c7ffbf2
[CI] Fix `rerun-ut` workflow: add DeepEP install, RDMA env, Blackwell…
hnyls2002 Mar 3, 2026
e6411ba
Increase max_concurrent_jobs in job queue (#19797)
ekzhang Mar 3, 2026
441045a
[AMD] Fix EAGLE3 speculative decoding with aiter attention backend (#…
hubertlu-tw Mar 4, 2026
c18cff4
[CI] Add DeepGEMM warmup to stage-c-test-deepep-4-gpu (#19806)
alisonshao Mar 4, 2026
a69b943
[SGLang-Diffusion] Add offline throughput benchmark script for multi-…
haojin2 Mar 4, 2026
e2af840
Various SM120 improvements (#19721)
b8zhong Mar 4, 2026
eb6bcc5
[CI] Register test_quant_config_parsing.py in CI suite (#19809)
alisonshao Mar 4, 2026
d22c6a3
fix: Properly return abort error for streaming requests if the abort …
pyc96 Mar 4, 2026
ee5ccde
support fused_moe_triton and moe_sum_all_reduce kernel fusion[reduce …
xieminghe1 Mar 4, 2026
115e9a1
[Diffusion] Delete useless _ulysses_input_split func (#19786)
BBuf Mar 4, 2026
525d046
[AMD] CI - new runner label for MI325 8gpu (#19815)
yctseng0211 Mar 4, 2026
82e7139
[VLM] Support cos sin cache for Ernie4.5-VL (#19743)
yuan-luo Mar 4, 2026
4348976
[Diffusion] Refactor diffusion benchmark/profile skill to reuse diffu…
BBuf Mar 4, 2026
8829069
Add kpham-sgl into CI Permission list (#19819)
kpham-sgl Mar 4, 2026
6851613
[Bugfix] For cp: Fixed hang problem in prefix cache and kvcache suppo…
Baidu-AIAK Mar 4, 2026
da9dcbc
[diffusion] fix: fix corrupted image editing outputs in Multi-GPU SP …
ruihanglix Mar 4, 2026
ca44aa2
Fix dp_attention crash when dp_size < tp_size in warmup dummy run (#1…
yhyang201 Mar 4, 2026
b7f7df7
[NSA] Fix line-too-long lint in `can_nsa_prefill_cp_round_robin_split…
sglang-bot Mar 4, 2026
73bf2c5
[sgl]add pin_mem to remove cpu->gpu copy sync point (#19795)
bixue2010 Mar 4, 2026
ac1f074
Fix triton alloc extend kernel (#19780)
whybeyoung Mar 4, 2026
5972f97
Remove naive rotary forward overriding. (#19263)
LorrinWWW Mar 4, 2026
c6850ac
[AMD] Fix Qwen3-Coder-Next: Add missing k_scale/v_scale args to exten…
michaelzhang-ai Mar 4, 2026
e9b5706
[diffusion] feat: support torch compile for diffusers backend (#19673)
DefTruth Mar 4, 2026
562c3ff
[Feature] implement the standard multi-layer MTP for step3p5 (#18564)
zhaziqwe Mar 4, 2026
c287d9b
chore: add flashinfer version bump workflow (#19837)
Fridge003 Mar 4, 2026
115f879
Helios: Real Real-Time Long Video Generation Model (#19782)
yhyang201 Mar 4, 2026
09fa012
Fix /health regression from early prebound socket listen (#19805)
mmangkad Mar 4, 2026
78ddf05
[Fix] Install tomli in flashinfer bumping workflow (#19841)
Fridge003 Mar 4, 2026
52dcade
Fix flashinfer bump workflow (#19855)
Fridge003 Mar 4, 2026
f07d668
Remove flashinfer version argument from cu13 docker release workflow …
Fridge003 Mar 4, 2026
17119a6
Optimization: Reduce the number of D2H operations (#19424)
wangfakang Mar 4, 2026
88cfa6c
[NPU]Releasing redundant memory of w13_weight and nz when the ascend_…
chenxu214 Mar 4, 2026
e33e833
update model names (#19870)
amote-i Mar 4, 2026
c2b66d3
[HiCache] Add an env var to control transfer engine reuse (#19867)
ShangmingCai Mar 4, 2026
6910c1b
[Feature][NPU]: add runtime support for GPTQ-quantized MoE models (#1…
YChange01 Mar 4, 2026
738ebfd
KDA: fuse qkv conv and support stride for fused_sigmoid_gating_delta_…
strgrb Mar 4, 2026
34c19a3
fix flaky test for test_kda_kernels (#19864)
strgrb Mar 4, 2026
c03deb8
Fix disagg PD bootstrap and KV transfer metrics (#19009)
Kangyan-Zhou Mar 4, 2026
44208d2
[vlm][minicpm] support input formats of processor output and embeddin…
jiangyukunok Mar 4, 2026
1b76eb9
[Doc] Update version references and add automation (#18409)
mmangkad Mar 4, 2026
329817e
[AMD] Move get_global_server_args import out of CUDA-only block to fi…
bingxche Mar 4, 2026
0ee9d3c
fix(grpc): send last chunk before completion during streaming (#19895)
CatherineSue Mar 4, 2026
9457c04
[Qwen3.5] Enable MTP spec_v2 and add test for nvidia/Qwen3.5-397B-A17…
hlu1 Mar 4, 2026
28c931e
feat: Priority-based scheduling optimization (including default prior…
zhuxinjie-nz Mar 4, 2026
376dfb0
Fix issue 19717 by making `qo_indptr` uniform strided instead of pack…
kpham-sgl Mar 4, 2026
a710b7d
[Sarvam] Add inference support for Sarvam MoE LLMs (#18938)
rakesh-sarvam Mar 4, 2026
33c9273
[Triton] Use dynamic loop bound in `alloc_extend_kernel` (#19898)
hnyls2002 Mar 5, 2026
43bdee7
Fix Fp8 MTP layer a2a backend without EP. (#18515)
wenscarl Mar 5, 2026
861d786
[CI] remove itl testing due to unstable networking (#19904)
hnyls2002 Mar 5, 2026
0eb64c1
[smg] Extract tokenizer_path from /model_info into discovered labels …
Kangyan-Zhou Mar 5, 2026
d8427d0
[NPU][CI] Cache pytorch dependency in ci (#19754)
monkeyLoveding Mar 5, 2026
e555a6c
[feat] Enhance lora_update_weight_from_tensor for RL training (#19314)
yushengsu-thu Mar 5, 2026
9795b4c
[Diffusion] Open t5 encoder parallel folding for wan2.2 and mova vide…
BBuf Mar 5, 2026
fc53307
[diffusion] hardware: SiluAndMul/RMSNorm/LayerNorm MUSA implementatio…
yeahdongcn Mar 5, 2026
9c11a7a
[diffusion] fix: fix the frame interpolation testcase in CI regarding…
yyy1000 Mar 5, 2026
198381d
Add SSL/TLS support for HTTP and gRPC servers (#18973)
Kangyan-Zhou Mar 5, 2026
0e6a647
[bugfix] Fix PPMissingLayer AttributeError when Using PP (#19804)
RolaoDenthu Mar 5, 2026
feda2b1
[AMD] Add AWQ AMD CI coverage and quantization platform compatibility…
brucechanglongxu Mar 5, 2026
10c65df
[Bug] Fix lora tp bug on H200 (#19769)
Fridge003 Mar 5, 2026
86c5617
[BUG]: fix prevent illegal memory access in Mamba SSM tracking during…
ConnorLi96 Mar 5, 2026
1bbfed0
[misc] add env for http keep alive timeout (#19847)
happierpig Mar 5, 2026
2bdd89a
[Kernel Slimming] Migrate NVFP4 kernels to JIT (#19437)
mmangkad Mar 5, 2026
c1df359
Add XPU profiler activity support in benchmark code (#12981)
kalyank007 Mar 5, 2026
727face
[DLLM] Add initial radix cache support (#18724)
btw616 Mar 5, 2026
dbc896f
[Test] Enhance JIT kvcache store kernel test coverage (#19630)
xingsy97 Mar 5, 2026
c36de62
[diffusion] fix images/edit with 2 images (#17520)
qimcis Mar 5, 2026
472eef4
fa4 cleanup (#19727)
rainj-me Mar 5, 2026
806d41a
[quant] fix fp32 downcasting (#19844)
zhooooong Mar 5, 2026
0de0d74
[EPD][Feat]support adaptive forward (#18118)
ZhengWG Mar 5, 2026
b5edab5
[AMD] CI - Add MI35x nightly/PR tests for kv-cache-fp8 and allreduce-…
yctseng0211 Mar 5, 2026
73d272b
Revised fix for HybridAttnBackend forward for linear attn (#19369)
akhilg-nv Mar 5, 2026
203cd8e
[AMD] [Z-Image-Turbo Day 0] Add Z-Image-Turbo nightly test for AMD GP…
michaelzhang-ai Mar 5, 2026
d605d81
update CODEOWNERS (#19969)
mickqian Mar 5, 2026
41fd53f
Fix `profile_activities` parameter name in `bench_one_batch_server_in…
mmangkad Mar 5, 2026
e58391d
Add --json-log flag to enable structured JSON logging (#19968)
jonalee99 Mar 5, 2026
ff6048f
rename nemotron reasoning parser (#19865)
danielafrimi Mar 5, 2026
ebb66cc
[misc] Priority scheduling metrics cleanup (#19927)
sglang-bot Mar 5, 2026
07e7603
Update sgl-attn to include SWA decode optimizations (#19655)
zminglei Mar 5, 2026
b3cfad0
Add Ray actor support for scheduler process management (DP=1) (#17684)
xyuzh Mar 5, 2026
1c1712d
[CI] Skip flashinfer-cubin reinstall when version matches (#19470)
alisonshao Mar 5, 2026
346a413
[Spec] Refactor NaN/OOB checks to async `maybe_detect_*` with env-var…
kpham-sgl Mar 5, 2026
46dced6
Adjust padding size to improve triton_kernels moe performance (#19174)
Qiaolin-Yu Mar 5, 2026
dd2bbe6
fix(grpc): use context.abort() with proper status codes instead of in…
CatherineSue Mar 5, 2026
13af7cb
fix: use consistent time denominator for throughput metrics in bench_…
AjAnubolu Mar 5, 2026
9ebffef
[FIX] NSA backend page_table overflow in speculative decoding target_…
JustinTong0323 Mar 6, 2026
261be85
Support mrope_position_delta cache
shadowxz109 Mar 6, 2026
b912d7a
[OPT]Skip the first delayer to maximize the BS of the decoding. (#19836)
chenxu214 Mar 6, 2026
5471e4a
[NPU][Feature] eliminate dsv3 redundant rotary embed calculation (#19…
liupeng374 Mar 6, 2026
9502369
fix(grpc): add server-side keepalive options to prevent GOAWAY (#19986)
slin1237 Mar 6, 2026
6e5a2de
[diffusion] fix: fix reading multiple prompts from prompt file (#19075)
sushildubey171 Mar 6, 2026
51e5dc8
Revert "[Kernel Slimming] Migrate NVFP4 kernels to JIT" (#20005)
Fridge003 Mar 6, 2026
be9a9e4
refactor(multimodal/test): centralize model names and shared utilitie…
Godmook Mar 6, 2026
04e364d
[V32] Enhance deepseek v32 related tests (#19985)
Fridge003 Mar 6, 2026
da27d9b
[Bug-Fix][EPD]: skip log waiting-image-req for zmq_to_tokenzer/moonca…
ZhengWG Mar 6, 2026
23eb24d
[AMD] Update/fix AMD CI workflow dispatch mechanism (#20014)
yctseng0211 Mar 6, 2026
0252ca8
[Bugfix] Fix the bug blocking the startup of Llama-3.2-11b-Vision-Ins…
xdtbynd Mar 6, 2026
27053aa
Fix MLA decode path returning unwritten (padded) rows (#19902)
clintg6 Mar 6, 2026
84aaa69
[AMD] Use bfloat16 for correction_bias in AITER FP8 path to avoid run…
inkcherry Mar 6, 2026
5505068
[AMD] Upgrade aiter version (#19936)
1am9trash Mar 6, 2026
25e678d
[diffusion] endpoint: add /server_info and /model_info endpoints for …
Kangyan-Zhou Mar 6, 2026
e3b581c
[diffusion] fix: remove num_frames in wan2_1_t2v_1_3b_lora_1gpu test …
Prozac614 Mar 6, 2026
f7de937
[GDN][Qwen3-Next][Qwen3.5] Fuse fused_gdn_gating and fused_recurrent_…
yuan-luo Mar 6, 2026
6d22c9f
[Diffusion] Move hf kernels diffusion cuda kernels skills to SGLD (#2…
BBuf Mar 6, 2026
2d266c7
Migrate renorm kernels from sgl-kernel to FlashInfer JIT (#18854)
Johnsonms Mar 6, 2026
54634b9
[Kernel] Dispatch exp/sin/cos through dtype_trait (#19798)
xingsy97 Mar 6, 2026
e8f2b80
[diffusion] improve: improve code readability of DenoisingStage (#20003)
JohnHerry Mar 6, 2026
de1a0af
[MUSA][10/N] Add GGUF support (#18357)
yeahdongcn Mar 6, 2026
759700c
Fix SM120 `triton_kernels` MXFP4 `block_k` for GPT-OSS (#20040)
mmangkad Mar 6, 2026
8cdb7e1
[CI] Add GPT-OSS test for SM120 (#20056)
mmangkad Mar 6, 2026
7a6cf0e
[Core] Extract `_calculate_mamba_ratio` and `_init_pools` from `init_…
hnyls2002 Mar 6, 2026
ac453b2
Add Qwen3.5-397B-A17B nightly test (8-GPU) (#19906)
alisonshao Mar 6, 2026
604db44
[Core] Clarify memory variable naming in model runner (#20060)
hnyls2002 Mar 6, 2026
e89069e
Fallback to torch.cuda.mem_get_info() when nvidia-smi is unavailable …
Kangyan-Zhou Mar 6, 2026
61de303
Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4…
wenscarl Mar 6, 2026
5c8e286
Add cleanup for _ATTN_TP in parallel_state.py (#19978)
aurickq Mar 6, 2026
7a12255
fix: set first_token_time before computing decode_throughput for sing…
Kangyan-Zhou Mar 7, 2026
ddcecde
[Core] Unify `max_num_reqs` `dp_size` division for pool sizing (#20063)
hnyls2002 Mar 7, 2026
069d4c5
Fix Kimi K2.5 PP layer range exposure for PD disaggregation (#19959)
yafengio Mar 7, 2026
0c4f98e
[diffusion] hardware: add set_musa_arch on MUSA (misc, 15/N) (#19381)
yeahdongcn Mar 7, 2026
e818f82
Fix none-comparison (E711) warnings (#19745)
yeahdongcn Mar 7, 2026
a0d085c
Adding correct path for module not found error while collecting test …
singhalshubham03 Mar 7, 2026
50bbdcf
Relax flaky test thresholds for MLA DeepSeek V3 and AutoRound (#20068)
alisonshao Mar 7, 2026
6ffc74e
[Metrics] Add overlap bubble timing, full KV usage gauge, and prefill…
happierpig Mar 7, 2026
f88acf8
[JIT Kernel] Reland NVFP4 kernels to JIT (#20012)
mmangkad Mar 7, 2026
8a411a9
[Tracing] Remove the deprecated tracing code from mini_lb (#19409)
sufeng-buaa Mar 7, 2026
925185f
Fix flashinfer backend with pcg (#20061)
Qiaolin-Yu Mar 7, 2026
c584158
[CI] Temporarily disable flaky test_priority_metrics on CUDA (#20075)
alisonshao Mar 7, 2026
31e93e4
[diffusion] fix: fix TeaCache silently fails with --enable-teacache (…
eitanturok Mar 7, 2026
b91fb83
[diffusion] fix: fix multi-prompt generation and support multiple pro…
RuixiangMa Mar 7, 2026
f8bbf56
Refactor NemotronHConfig to canonical layers_block_type and add MTP b…
danielafrimi Mar 7, 2026
011806c
[Nightly] Add Kimi K2.5 nightly test (base + Eagle3 MTP), replace Kim…
alisonshao Mar 7, 2026
c267bdb
[AMD] Fix AMD CI - stage-b-small-1-gpu-amd (partition 7) (#20028)
yctseng0211 Mar 7, 2026
43d6a32
[sgl-kernel] rebase FlashMLA 0217 (#18902)
FlamingoPg Mar 7, 2026
1aa6ab4
[Nightly] Replace MiniMax-M2 with MiniMax-M2.5 (#20083)
alisonshao Mar 7, 2026
bd108a5
Add workaround for aiter triton gemm config issue (#20090)
kkHuang-amd Mar 7, 2026
f016738
fix syntax error: "&&" unexpected (#20093)
kkHuang-amd Mar 7, 2026
19c51fe
fix(rope): restore K writeback in fused rope + kv store kernel (#19636)
dcw02 Mar 7, 2026
ef6540b
[diffusion]: add width/height passthrough for OpenAI image API (#19970)
Ratish1 Mar 7, 2026
f8d4eb7
[Docs] Add docstrings to JIT kernel include headers (#19770)
xingsy97 Mar 7, 2026
5297b02
[Diffusion] [NPU] Wan2.2-T2V-A14B-Diffusers modelslim quantization su…
OrangeRedeng Mar 7, 2026
fd79cd8
[Skills] Refine jit_kernel and sgl-kernel skills (#20095)
BBuf Mar 7, 2026
13bdc7b
[Feature][NPU]: add runtime support for AutoRound quantized models (#…
YChange01 Mar 7, 2026
7da590d
[Qwen3.5] Support Qwen3.5 Pipeline Parallelism (#19670)
yuan-luo Mar 7, 2026
17721b0
[AMD] Fix Tensor Memory Aliasing (#19928)
bingxche Mar 7, 2026
5867c3f
Support HiCache for MambaRadixCache (#19663)
ispobock Mar 7, 2026
45bd30e
[NPU] make torch_native lora backend a little bit faster (#17228)
VDV1985 Mar 7, 2026
7bd3dd9
fix: image URL in notebook to use raw.githubusercontent.com (#20100)
alphabetc1 Mar 7, 2026
0f62da6
[CI] Show test partition assignments after checkout (#20085)
alisonshao Mar 7, 2026
d28f352
[V32/GLM5] Change default setting of V32 nvfp4 on TP4 (#20086)
Fridge003 Mar 7, 2026
69ec15a
[CI] Re-enable streaming session tests with correctness and stress co…
hnyls2002 Mar 8, 2026
d02c515
Decouple scheduler log printing from metrics collection (#20107)
hnyls2002 Mar 8, 2026
72f6dfc
fix: add ModelScope cache lookup and speculative path support (#20098)
alphabetc1 Mar 8, 2026
a73369c
[diffusion] chore: ensure CFG Zero Star numerical stability for Helio…
RuixiangMa Mar 8, 2026
7f9f85d
[diffusion] feat: make QwenImageLayered resolution configurable (#20044)
xingsy97 Mar 8, 2026
7fb282a
[diffusion] fix: fix bug of copy_if (#20094)
linfann Mar 8, 2026
97a2a9b
[VLM] Replace conv3d proj with linear for GLM4V (#20033)
yuan-luo Mar 8, 2026
4836f76
[AMD] Add Claude skills for AMD CI workflows
michaelzhang-ai Mar 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
561 changes: 561 additions & 0 deletions .claude/skills/add-jit-kernel/SKILL.md

Large diffs are not rendered by default.

364 changes: 364 additions & 0 deletions .claude/skills/add-sgl-kernel/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,364 @@
---
name: add-sgl-kernel
description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
---

# Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.

## Goal

Add a new operation that scales each element of a tensor by a scalar factor:

- Input: tensor `x` (CUDA) and scalar `factor` (float)
- Output: `x * factor` (element-wise, in-place or into pre-allocated `out`)
- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
- Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`)

## Two rules of thumb (must follow)

1. **Prefer `python/sglang/jit_kernel` first** when the kernel does **not** depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
2. **Prefer `sgl-kernel`** when the kernel **does** depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
3. **Exception**: if the dependency is `flashinfer`, or CUTLASS that is already provided through `flashinfer`, the kernel can still be implemented as `jit_kernel`.

In addition, every new kernel must ship with:

- **Tests** (pytest)
- **A benchmark script** (triton.testing)

---

## Repository integration map

You will typically touch these files/areas:

- Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory)
- Public declarations: `sgl-kernel/include/sgl_kernel_ops.h`
- Torch extension registration: `sgl-kernel/csrc/common_extension.cc`
- Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`)
- Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py`
- Tests: `sgl-kernel/tests/test_scale.py`
- Benchmarks: `sgl-kernel/benchmark/bench_scale.py`

---

## Step 1: Implement the kernel in `csrc/`

Pick the right subdirectory:

- `csrc/elementwise/` — for element-wise ops (our example)
- `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories

Create `sgl-kernel/csrc/elementwise/scale.cu`:

```cpp
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>

#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16

// scale_kernel: out[i] = input[i] * factor
// Supports float, half (__half), __nv_bfloat16 via template T
template <typename T>
__global__ void scale_kernel(T* __restrict__ out,
const T* __restrict__ input,
float factor,
int64_t n) {
int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
if (idx < n) {
out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
}
}

void scale(at::Tensor& out, const at::Tensor& input, double factor) {
TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor");
TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor");
TORCH_CHECK(out.is_contiguous(), "out must be contiguous");
TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape");
TORCH_CHECK(out.scalar_type() == input.scalar_type(),
"out and input must have the same dtype");

const int64_t n = input.numel();
const int threads = 256;
const int blocks = (n + threads - 1) / threads;

const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));

// Dispatches over float, float16, bfloat16
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
static_cast<c_type*>(out.data_ptr()),
static_cast<const c_type*>(input.data_ptr()),
static_cast<float>(factor),
n);
cudaError_t status = cudaGetLastError();
TORCH_CHECK(status == cudaSuccess,
"scale_kernel launch failed: ", cudaGetErrorString(status));
return true;
});
}
```

**Key points:**

- Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream
- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16)
- Add device error checking after every kernel launch
- If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests

---

## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`

Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section:

```cpp
void scale(at::Tensor& out, const at::Tensor& input, double factor);
```

---

## Step 3: Register the op in `csrc/common_extension.cc`

Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`:

```cpp
// From csrc/elementwise
m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
m.impl("scale", torch::kCUDA, &scale);
```

**Key points:**

- `Tensor!` means in-place / mutable output argument
- The schema is important for `torch.compile` and for consistent call signatures
- If your underlying C++ API uses `float` but PyTorch bindings expect `double`, the implicit cast is fine for scalars; use shims if needed for other types

---

## Step 4: Add the new source file to `CMakeLists.txt`

Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`:

```cmake
csrc/elementwise/scale.cu
```

**Key points:**

- Keep the list **alphabetically sorted** (the file explicitly requires this)
- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic

---

## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`

Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:

- implement the Python wrapper in `sgl-kernel/python/sgl_kernel/elementwise.py`
- then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py`

For example, in `sgl-kernel/python/sgl_kernel/elementwise.py`, add:

```python
import torch

def scale(
input: torch.Tensor,
factor: float,
out: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Element-wise scale: out = input * factor.

Supported dtypes: torch.float16, torch.bfloat16, torch.float32.

Parameters
----------
input : CUDA input tensor
factor : scale factor (float)
out : optional pre-allocated CUDA output tensor (same shape/dtype as input)
"""
if out is None:
out = torch.empty_like(input)
torch.ops.sgl_kernel.scale.default(out, input, factor)
return out
```

Then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` following the existing import style used by other kernels.

---

## Step 6: Write tests (required)

Create `sgl-kernel/tests/test_scale.py`:
```python
import pytest

import torch
import sgl_kernel

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
def test_scale_correctness(dtype, size, factor):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)

result = sgl_kernel.scale(input, factor, out=out)
assert result is out

expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)


def test_scale_shape_mismatch():
input = torch.randn(128, dtype=torch.float16, device="cuda")
out = torch.empty(256, dtype=torch.float16, device="cuda")
with pytest.raises(RuntimeError, match="same shape"):
sgl_kernel.scale(input, 2.0, out=out)


def test_scale_cpu_input():
input = torch.randn(128, dtype=torch.float16) # CPU
out = torch.empty_like(input)
with pytest.raises(RuntimeError, match="CUDA"):
sgl_kernel.scale(input, 2.0, out=out)


if __name__ == "__main__":
pytest.main([__file__, "-q"])
```

---

## Step 7: Add a benchmark (required)

Create `sgl-kernel/benchmark/bench_scale.py`:

```python
import itertools
import os

import torch
import triton
import triton.testing

import sgl_kernel

IS_CI = (
os.getenv("CI", "false").lower() == "true"
or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
)

dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K
factors = [2.0]

configs = list(itertools.product(dtypes, sizes))


def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
return input * factor


@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dtype", "size"],
x_vals=configs,
line_arg="provider",
line_vals=["sglang", "torch"],
line_names=["SGL Kernel", "PyTorch"],
styles=[("green", "-"), ("red", "--")],
ylabel="µs (median)",
plot_name="scale-performance",
args={},
)
)
def benchmark(dtype, size, provider):
input = torch.randn(size, dtype=dtype, device="cuda")
out = torch.empty_like(input)
factor = 2.0

if provider == "sglang":
fn = lambda: sgl_kernel.scale(input, factor, out=out)
else:
fn = lambda: torch_scale(input, factor)

ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms


if __name__ == "__main__":
benchmark.run(print_data=True)
```

---

## Step 8: Build

Build:

```bash
cd sgl-kernel
make build -j16
```

If you need to limit host resource usage:

```bash
cd sgl-kernel
make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
```

---

## Step 9: Validate

After building successfully, run the test and benchmark:

```bash
pytest sgl-kernel/tests/test_scale.py -q
python sgl-kernel/benchmark/bench_scale.py
```

---

## Troubleshooting

- **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1`
- **Memory errors**: `compute-sanitizer --tool memcheck python ...`
- **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS`
- **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py`
- **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time

---

## References

- `sgl-kernel/README.md`
- `sgl-kernel/include/sgl_kernel_ops.h`
- `sgl-kernel/csrc/common_extension.cc`
- `sgl-kernel/CMakeLists.txt`
- `sgl-kernel/include/utils.h` — `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends
- `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern

## Summary of Files Created/Modified

```
sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher
sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration
sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration
sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical)
sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper
sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: re-export Python API
sgl-kernel/tests/test_scale.py # NEW: tests
sgl-kernel/benchmark/bench_scale.py # NEW: benchmark
```
Loading
Loading