Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
396 commits
Select commit Hold shift + click to select a range
da0c026
Tiny assert EPLB is used together with expert parallel (#8381)
fzyzcjy Jul 26, 2025
b7094a5
model: support intern-s1 (#8350)
RunningLeon Jul 26, 2025
5c705b1
Add perf tests for LoRA (#8314)
lifuhuang Jul 26, 2025
7615463
Remove slot usage in code to be backward-compatible with python 3.9 (…
lifuhuang Jul 27, 2025
62a6b7c
Add docker release flow for gb200 (#8394)
kyleliang-nv Jul 27, 2025
528bd1e
HiCache, check before terminate prefetching (#8372)
xiezhq-hermann Jul 27, 2025
426b749
Add nvfp4 scaled mm benchmark. (#8401)
HydraQYH Jul 27, 2025
b602f42
Urgent Fix: intern-s1 chat-template matching (#8403)
JustinTong0323 Jul 27, 2025
ed0fdbf
Tool to dump and compare internal activation tensors (#7976)
fzyzcjy Jul 27, 2025
62222bd
Minor tool for comparison of benchmark results (#7974)
fzyzcjy Jul 27, 2025
e34cf6a
Fix bench script making input data on L2 cache (#7739)
fzyzcjy Jul 27, 2025
85486b6
[NVIDIA] Add Flashinfer MoE blockscale fp8 backend (#8036)
kaixih Jul 27, 2025
91e3d15
Update Cutlass in sgl-kernel to v4.1 (#8392)
Fridge003 Jul 27, 2025
0bcc195
fix: minor fix TransportProxyTensor under tp (#8382)
mickqian Jul 27, 2025
2ab9702
[router] add different policies for p node and d node (#8395)
slin1237 Jul 27, 2025
2a1936d
Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-In…
lambert0312 Jul 27, 2025
36d6f0b
fix: fix the missing metrics on non-rank0 nodes (#7720)
acelyc111 Jul 27, 2025
bf0f448
[2/N] MoE Refactor: Unify weight loader and quant methods (#8397)
ch-wan Jul 27, 2025
5c9c275
Use FlashInfer FP4 gemm. (#8241)
elfiegg Jul 27, 2025
44d600c
Support precomputed_embeddings for Llama 4 (#8156)
AlienKevin Jul 27, 2025
4d921f2
[hotfix] fix merge conflicts in FlashInferEPMoE (#8405)
ch-wan Jul 27, 2025
bf3352c
chore: update CODEOWNERS (#8407)
zhyncs Jul 27, 2025
10ee895
chore: upgrade flashinfer v0.2.9rc2 (#8406)
zhyncs Jul 27, 2025
b3eac16
Support triton kernels v3.4.0 for fused_moe (#8258)
yuan-luo Jul 27, 2025
22e00ee
[Bugfix] Prevent PD server crash from invalid grammar (#8062)
ShangmingCai Jul 27, 2025
95217a9
Change to use native arm runner (#8414)
kyleliang-nv Jul 27, 2025
df90645
Support overlapped lora updates (#8213)
lifuhuang Jul 27, 2025
b58c3c2
Support ue8m0 for triton quant kernel (#7603)
fzyzcjy Jul 27, 2025
e983d66
Fix: Improve test_openai_function_calling unit test and fix reasoning…
byjiang1996 Jul 27, 2025
b47eda3
bugfix: Fix multiple finish_reason chunks and tool_calls finish reaso…
CatherineSue Jul 27, 2025
58dd95f
Fix test_openai_server (#8419)
CatherineSue Jul 27, 2025
bb81dae
Fix docker buildx push error (#8425)
kyleliang-nv Jul 28, 2025
dd487e5
bugfix: Fix XGrammar backend to use model's EOS tokens for constraine…
CatherineSue Jul 28, 2025
fe6a445
[router] improve router logs and request id header (#8415)
slin1237 Jul 28, 2025
2810338
[feat] Support different attention backends for prefill and decode (…
Qiaolin-Yu Jul 28, 2025
4ad9737
chore: bump transformer to 4.54.0 (#8416)
hebiao064 Jul 28, 2025
2fd5c70
[PD] Fix abort_request for PD disaggregation (#8352)
ShangmingCai Jul 28, 2025
6d6a8bc
GLM-4.5 Model Support (#8224)
zRzRzRzRzRzRzR Jul 28, 2025
5922c0c
Remove zstd compression for building Dockerfile.gb200 (#8442)
kyleliang-nv Jul 28, 2025
484d0e0
doc: add bench_one_batch_server in the benchmark doc (#8441)
Qiaolin-Yu Jul 28, 2025
581e7dc
GLM-4.5 Model Support Follow-up (#8445)
byjiang1996 Jul 28, 2025
25f73c6
fix GLM4_MOE launch with compressed_tensor quant model (#8456)
zminglei Jul 28, 2025
fb4ce17
Fix per_token_group_quant_8bit when hidden_dim // group_size is not d…
strgrb Jul 28, 2025
2262369
Revert "[kernel] opt moe align block kernel by block/warp scan algori…
BBuf Jul 28, 2025
45bc170
chore: bump v0.4.9.post5 (#8458)
zhyncs Jul 28, 2025
a9dd3ec
fix:reorder topk experts to ensure shared expert replaces minimal sco…
erictanjn Jul 28, 2025
b582159
Update PR template (#8465)
ispobock Jul 28, 2025
747dd45
feat: throttle requests at scheduler based on --max_queued_requests (…
harrisonlimh Jul 28, 2025
ccfe52a
fix: update dep (#8467)
zhyncs Jul 28, 2025
134fa43
[NVIDIA] Change to use `num_local_experts` (#8453)
kaixih Jul 28, 2025
c8f549d
Fix parsing ChatCompletionMessage (#7273)
Onyad Jul 28, 2025
9c138a0
[3/N] MoE Refactor: Simplify DeepEP Output (#8421)
ch-wan Jul 28, 2025
1466c1b
feat: support glm4 tuning (#8473)
zhyncs Jul 28, 2025
74e7e45
Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (…
hebiao064 Jul 28, 2025
bd51694
Update codeowner (#8476)
merrymercy Jul 28, 2025
3a04aa4
chore: add glm4 fp8 tp8 config (#8478)
zhyncs Jul 28, 2025
8240a6b
chore: add glm 4.5 fp8 tp4 config (#8480)
zhyncs Jul 28, 2025
7c96971
[CI]Add genai-bench Performance Validation for PD Router (#8477)
key4ng Jul 28, 2025
001bffc
Update CODEOWNERS (#8485)
merrymercy Jul 29, 2025
69712e6
Rename the last step in pr-test.yml as pr-test-finish (#8486)
merrymercy Jul 29, 2025
7df2c0c
Reduce memory usage for fp4 moe (#8413)
fzyzcjy Jul 29, 2025
59d0bf0
Tiny add warnings for DeepEP when it is suboptimal (#8426)
fzyzcjy Jul 29, 2025
0ce84c8
Support colocating requests (#7973)
fzyzcjy Jul 29, 2025
fb16fba
Fix incorrect KV cache allocation for MTP models. (#8482)
lifuhuang Jul 29, 2025
2e1d2d7
Add PVC and update resource limits in k8s config (#8489)
haitwang-cloud Jul 29, 2025
6478831
chore: bump v0.4.9.post6 (#8517)
zhyncs Jul 29, 2025
263c923
Always trigger pr-test (#8527)
merrymercy Jul 29, 2025
8136706
Update README.md (#8528)
merrymercy Jul 29, 2025
7a4309c
[sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 …
BBuf Jul 29, 2025
4d16c88
Update cutlass_moe.py (#8535)
elfiegg Jul 29, 2025
5973675
Fix moe align kernel test (#8531)
ispobock Jul 29, 2025
a4c3b12
Split the scheduler into multiple mixin classes to reduce the file si…
merrymercy Jul 29, 2025
c0fd77e
bring back kimi vl ci (#8537)
hebiao064 Jul 29, 2025
1992ef9
fix: temporarily disable cuda-ipc for mm data tensor (#8431)
mickqian Jul 29, 2025
9effeb5
Support EPLB in FusedMoE (#8448)
ch-wan Jul 29, 2025
a85ebf5
feat(hicache): support file backend reading directory config form env…
hzh0425 Jul 30, 2025
2fbb754
feature(pd-hicache): Prefill instances support reusing the RemoteStor…
hzh0425 Jul 30, 2025
a9fd803
[router] allow longer time out for router e2e (#8560)
slin1237 Jul 30, 2025
e3f08c7
Update cutlass_moe.py (#8545)
elfiegg Jul 30, 2025
55ecdc0
Update CODEOWNERS (#8562)
ShangmingCai Jul 30, 2025
a730ce8
[feature] [sgl-router] Add a dp-aware routing strategy (#6869)
oldsharp Jul 30, 2025
3bdcdd1
[Hot-Fix] moe_aligned_block_size CI failed in AMD (#8461)
yuan-luo Jul 30, 2025
ec5f944
[Model] Add support for Arcee Foundational Model (#8154)
adarshxs Jul 30, 2025
a79a5d7
Revert "Fix the input tools format and history tool_calls in OpenAI A…
CatherineSue Jul 30, 2025
2998033
Add hf3fs support for hicache storage (based on #7704) (#7280)
pansicheng Jul 31, 2025
66a398f
[router] migrate router from actix to axum (#8479)
slin1237 Jul 31, 2025
9b9e825
[Fix]Fix index oob in get_group_gemm_starts kernel. (#8564)
HydraQYH Jul 31, 2025
67e53b1
Bump transfomers to 4.54.1 to fix Gemma cache issue. (#8541)
lifuhuang Jul 31, 2025
659bfd1
Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PA…
pyc96 Jul 31, 2025
59aab76
Bug: Fix google gemma3n-mm audio input not working bug (#8365)
byjiang1996 Jul 31, 2025
a5f5ab4
update sgl-kernel for EP: kernel part (#8514)
ch-wan Jul 31, 2025
43118f5
chore: bump sgl-kernel v0.2.8 (#8599)
zhyncs Jul 31, 2025
5963e50
[bugfix] Fix 2 minor bugs in the hicache storage layer (#8404)
yapple Jul 31, 2025
26c8a31
fix incorrect increase of hit count (#8533)
huangtingwei9988 Jul 31, 2025
d904959
Support l3 cache (mooncake store) for hiradix cache (#7211)
huangtingwei9988 Jul 31, 2025
e179e0b
update sgl-kernel for EP: python part (#8550)
ch-wan Jul 31, 2025
e7dc163
add SVG logo (#8603)
hnyls2002 Jul 31, 2025
32fa1e9
[4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (#8515)
ch-wan Jul 31, 2025
09f1a24
fix: fork should not run pypi router (#8604)
yihong0618 Jul 31, 2025
51c3816
model: support Step3V (#8583)
CatherineSue Jul 31, 2025
7a1f7fc
[Feature] Hybrid EP and TP (#8590)
ch-wan Jul 31, 2025
0232886
chore: bump v0.4.10 (#8608)
zhyncs Jul 31, 2025
016fd25
[PD] Use batch transfer for rdma transport and add notes for mnnvl us…
ShangmingCai Jul 31, 2025
5d15fb8
[bugifx] QWen-1M context support[2/3] using current cuda stream in th…
sighingnow Jul 31, 2025
3c307dc
Fix hf3fs_fuse import error (#8623)
ispobock Jul 31, 2025
8fbcfd0
Update step3v default config (#8626)
ispobock Jul 31, 2025
ae80777
[ci] fix genai-bench execution cmd (#8629)
slin1237 Jul 31, 2025
aee0ef5
[router] update router pypi version (#8628)
slin1237 Jul 31, 2025
4acf690
[Optimization][Perf] Disable the GC during CUDA graph capture to spee…
b8zhong Jul 31, 2025
061c895
Fix typos in py_test/test_launch_server.py (#6227)
windsonsea Jul 31, 2025
743638b
misc: Remove debug print to logger.info (#8633)
CatherineSue Jul 31, 2025
2cd2e27
SGLang HiCache NIXL Connector (#8488)
vvenkates27 Jul 31, 2025
5c14515
[bug] remove pdlb from minilb since its no longer available (#8634)
slin1237 Jul 31, 2025
b7170cc
[bugfix] Fix flashinfer cutlass EP moe after MoE refactor (#8630)
trevor-m Jul 31, 2025
3dde861
Conditionally import HiCacheHF3FS (#8598)
pansicheng Jul 31, 2025
4b04998
TRTLLM Gen MLA Decode Kernel Integration (same as #7938) (#8632)
farazkh80 Jul 31, 2025
4a6e7a6
Fix nan value generated after custom all reduce (#8532)
kkHuang-amd Jul 31, 2025
0ad098b
Revert "Fix nan value generated after custom all reduce (#8532)" (#8642)
zhyncs Aug 1, 2025
0491343
Feature/modelscope model download (#8083)
yrk111222 Aug 1, 2025
fe5086f
chore: speedup NPU CI by cache (#8270)
pkking Aug 1, 2025
99795d6
[Bugfix] fix w8a8_int8 load issue (#8308)
iforgetmyname Aug 1, 2025
2886e23
[bugfix] fix router python parser for pd urls (#8644)
slin1237 Aug 1, 2025
f6f46f4
[router] add basic usage doc (#8640)
slin1237 Aug 1, 2025
39decec
[router] upgrade router version to 0.1.8 (#8645)
slin1237 Aug 1, 2025
aa4c66b
[NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (#8450)
kaixih Aug 1, 2025
9305ea6
HiCache, fixing hash value indexing (#8636)
xiezhq-hermann Aug 1, 2025
dd7ca00
Interface change for kvcache io to support page first layout (#8318)
xiezhq-hermann Aug 1, 2025
e7e5a30
Update batch size limitation of dsv3_router_gemm kernel to 16 (#8051)
Fridge003 Aug 1, 2025
33f0de3
chore: bump v0.4.10.post1 (#8652)
ispobock Aug 1, 2025
20b5563
Add hf3fs_utils.cpp to package-data (#8653)
pansicheng Aug 1, 2025
7e831ef
Fix chat template handling for OpenAI serving (#8635)
JustinTong0323 Aug 1, 2025
c8d3a40
Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay…
byjiang1996 Aug 1, 2025
6c88f6c
[5/N] MoE Refactor: Update MoE parallelism arguments (#8658)
ch-wan Aug 1, 2025
46e9d1c
Increase tolerance to address CI failures (#8643)
lifuhuang Aug 1, 2025
6bdd278
[Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (#8013)
panpan0000 Aug 1, 2025
533cb5b
[DOC]Update sgl-kernel README (#8665)
Hongbosherlock Aug 1, 2025
db7343c
fix per token cuda kernel hidden dim cannot divide by 16 (#8543)
hebiao064 Aug 1, 2025
b17c5b0
fix arg typo for --disaggregation-transfer-backend (#8664)
ZacWang Aug 1, 2025
2d401bd
[fix] fix pd disagg error of vlms (#8094)
ccw1996 Aug 1, 2025
2ae95d1
Disable tp for shared experts under expert parallelism for GLM4.5 mod…
zminglei Aug 1, 2025
6a7528e
[bugfix] Fix page size for create_flashmla_kv_indices_triton() for cu…
trevor-m Aug 1, 2025
ab9b893
[bug] limit bootstrap room to to [0, 2^63 - 1] (#8684)
slin1237 Aug 1, 2025
07e46ec
Update CODEOWNERS (#8686)
merrymercy Aug 1, 2025
e252192
Fix deepgemm masked grouped gemm jit compile (#8679)
ispobock Aug 1, 2025
1fe691a
Fix FP8 block quantization when N or K is not multiples of 128 (#8648)
yanbing-j Aug 1, 2025
d1c4d51
bugfix(hicache): Fix 'MooncakeStore' not defined error. (#8668)
hzh0425 Aug 1, 2025
5deab12
upgrade xgrammar 0.1.22 (#8522)
Swipe4057 Aug 1, 2025
b89d37c
[bugfix] Add 'disaggregation_mode' parameter to warmup function when …
lbh2001 Aug 1, 2025
82e6c3a
Add support for NCCL symmetric memory for TP allreduces (#8238)
nvcastet Aug 1, 2025
f642524
[1/2] sgl-kernel: Fuse routed scaling factor into select_experts (#8364)
trevor-m Aug 2, 2025
b27b119
chore(gb200): update dockerfile to handle fp4 disaggregation (#8694)
ishandhanani Aug 2, 2025
89caf7a
[bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (#8…
trevor-m Aug 2, 2025
4bec99e
Fix: resolve prefill of retracted request out-of-memory issue when ig…
GaoYusong Aug 2, 2025
ea93079
model: adapt mllama4 to VisionAttention (#8512)
wenchen76 Aug 2, 2025
4ca43b0
Add tensor.detach() back to update weight util (#8691)
hebiao064 Aug 2, 2025
ac6962c
[Doc] Polish sgl-kernel readme for cu126 build error (#8704)
FlamingoPg Aug 2, 2025
f9f0138
Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_expe…
hnyls2002 Aug 2, 2025
6d4fd88
[router] minor code clean up and and refactoring (#8711)
slin1237 Aug 2, 2025
603f5ce
[Bug] fix green context's incompatibility with `cuda < 12.4` (#8701)
hnyls2002 Aug 2, 2025
0a56b72
chore: bump sgl-kernel v0.2.9 (#8713)
zhyncs Aug 2, 2025
403566b
Remove assertions about per group quant fp8 (#8717)
fzyzcjy Aug 3, 2025
e314b08
[FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (#8693)
merrymercy Aug 3, 2025
8ada1ab
Fix triton moe error caused by TopK refactor (#8705)
fzyzcjy Aug 3, 2025
828a4fe
[router] Implement HTTP Dependency Injection Pattern for Router Syste…
slin1237 Aug 3, 2025
e273aa6
[Feature] Radix Tree in C++ (#7369)
DarkSharpness Aug 3, 2025
d9def43
[Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwis…
HydraQYH Aug 3, 2025
9f47d68
Fix fused MoE when `routed_scaling_factor is None` (#8709)
hnyls2002 Aug 3, 2025
0e612db
Tiny fix CI pytest error (#8524)
fzyzcjy Aug 3, 2025
a437aa9
[hotfix] fix mixtral with tensor-level compressed-tensor quantization…
ch-wan Aug 3, 2025
8675bdf
Support limiting max loaded loras in CPU. (#8650)
lifuhuang Aug 3, 2025
0305c50
Reduce memory accumulation in long-running server (#8306)
Edenzzzz Aug 3, 2025
b0add2d
HiCache storage, style change and bug fix (#8719)
xiezhq-hermann Aug 3, 2025
f7b2853
[feat] support minimum token load balance in dp attention (#7379)
WANG-GH Aug 3, 2025
32f2815
Do layernorm before allgather for DP attention (#8631)
trevor-m Aug 3, 2025
7ed8e51
[fix] Fix divide by zero error for llama4. (#8683)
shenoyvvarun Aug 3, 2025
a31b7a7
feat: Add new moe triton for NVIDIA RTX 6000 Ada (#8547)
17Reset Aug 3, 2025
6f9baf1
[Improvements] Merge health check route (#8444)
whybeyoung Aug 3, 2025
5ce5093
chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (#8718)
zhyncs Aug 3, 2025
7a91330
Save cuda graph memory for fa3 (#8567)
ch-wan Aug 3, 2025
cb099d2
[CUDA Graph] save cuda graph memory by using next_token_logits_buffer…
ch-wan Aug 3, 2025
0e0eef0
[DP] fix the compatibility issue between DP attention and `--attentio…
ch-wan Aug 3, 2025
8cd3445
chore: bump v0.4.10.post2 (#8727)
zhyncs Aug 3, 2025
00da906
feat: Support DP Attention for step3_vl (#8699)
yhyang201 Aug 3, 2025
3435a24
[RL] fix update weight for FusedMoE with EP (#8676)
zhuzilin Aug 3, 2025
760286e
use fp32 for e_score_correction_bias in GLM-4.5 (#8729)
zRzRzRzRzRzRzR Aug 3, 2025
0242bb9
Fix triton kernels topk with keyword arguments (#8732)
ispobock Aug 3, 2025
e67276e
feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (#8678)
TianQiLin666666 Aug 3, 2025
ed6f759
Fix the missing 'lof' choice of --schedule-policy server args (#7114)
acelyc111 Aug 3, 2025
76ba5bb
fix args typo in memory_pool_host (#8662)
huangtingwei9988 Aug 3, 2025
7a27e79
[CI] Do not trigger pd-disaggregation CI in draft PR (#8737)
hnyls2002 Aug 3, 2025
b102353
[MoE] Enable `renormalize=False` in Triton kernels (#8735)
ch-wan Aug 4, 2025
f024795
Replace torch.jit.script with torch.compile in get_masked_input_and_m…
YyWangCS Aug 4, 2025
3b87a9e
Fix bug of refactoring TopKOutput in w4afp8 (#8745)
yuan-luo Aug 4, 2025
f2d68de
Rename lora_path to lora_id in batches (#8437)
Fridge003 Aug 4, 2025
f57d2dc
[sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (#8738)
BBuf Aug 4, 2025
fee0ab0
[CI] Ascend NPU CI enhancement (#8294)
iforgetmyname Aug 4, 2025
36fc926
[bugfix] fix import path in HiCacheController (#8749)
lbh2001 Aug 4, 2025
915140f
[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (#8552)
azhurkevich Aug 4, 2025
2fa0462
[router] introduce dp worker abstraction (#8639)
slin1237 Aug 4, 2025
9bd4872
[bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribu…
trevor-m Aug 4, 2025
fc8c8e5
Integrate triton_kernels in sgl-kernel (#8762)
Qiaolin-Yu Aug 4, 2025
02bc1c7
chore: bump sgl-kernel v0.3.1 (#8771)
zhyncs Aug 4, 2025
6d0646d
[NVIDIA] Fix breakage of using trtllm-gen fp8 moe (#8773)
kaixih Aug 4, 2025
7cb2075
[Fix] Fix several issues preventing gemma3n LoRA support. (#8776)
lifuhuang Aug 5, 2025
d4bf5a8
Support OCP MXFP4 quantization on AMD GPUs (#8255)
kkHuang-amd Aug 5, 2025
08f8f49
[CPU][sgl-kernel] biased_grouped_topk: fix correction_bias dtype to f…
chunyuan-w Aug 5, 2025
d98a491
[PD] Refactor parallel sizes and add pp support for mooncake (#8571)
ShangmingCai Aug 5, 2025
354ac43
[pd-router] Add Configurable Retry Logic for reduce backend pressure …
slin1237 Aug 5, 2025
1ea94d3
chore: upgrade flashinfer v0.2.9 (#8780)
zhyncs Aug 5, 2025
b01eeb8
[NVIDIA]Fix local_num_experts for EP (#8779)
wenscarl Aug 5, 2025
873f384
[feat] Add detail in image_data (#8596)
yuhyao Aug 5, 2025
5e91fed
Revert "[NVIDIA]Fix local_num_experts for EP (#8779)" (#8797)
zhyncs Aug 5, 2025
194561f
feat: support sgl-kernel cu129 (#8800)
zhyncs Aug 5, 2025
75df31b
chore: bump sgl-kernel v0.3.2 (#8802)
zhyncs Aug 5, 2025
40e3b2b
feat: add trtllm-gen mha from direct call (#8782)
yyihuang Aug 5, 2025
a4b0d5c
GLM-4.5 and GLM-4.5-Air both support (#8804)
zRzRzRzRzRzRzR Aug 5, 2025
8e8545c
fix: update cmake (#8817)
zhyncs Aug 5, 2025
901ab75
chore: upgrade transformers 4.55.0 (#8823)
zhyncs Aug 5, 2025
4f4e0e4
chore: upgrade flashinfer 0.2.10 (#8827)
zhyncs Aug 5, 2025
32d9e39
Fix potential memory fault issue and ncclSystemError in CI test (#8681)
kkHuang-amd Aug 5, 2025
4ef4783
feat: use py312 (#8832)
zhyncs Aug 5, 2025
556e414
fix: remove unused import (#8809)
zhyncs Aug 5, 2025
c1d2061
Add initial support for gpt-oss (#8824)
Ying1123 Aug 5, 2025
3ae8e3e
chore: upgrade torch 2.8.0 (#8836)
zhyncs Aug 6, 2025
5d62b56
[router] complete router oai spec (#8828)
slin1237 Aug 6, 2025
8128e08
Turn off hybrid cache by default (#8839)
ispobock Aug 6, 2025
d26ca84
Support bailing moe (#8680)
ppraneth Aug 6, 2025
ca47e24
[Feature] improve TBO: two chunk overlap (#8144)
House-West Aug 6, 2025
8c7bb39
[router] PD Router Simplification and Reorganization (#8838)
slin1237 Aug 6, 2025
8958817
[1/3] Optimize Slime Update Weights: Remove QWen3MOE Load Weight Over…
hebiao064 Aug 6, 2025
cbbb738
[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync wh…
hebiao064 Aug 6, 2025
168033d
Support mxfp4 for GPT-OSS (#8843)
Ying1123 Aug 6, 2025
4fc5f2f
Add unit test for triton swa kernel (#8853)
ispobock Aug 6, 2025
aeac900
fix: resolve ci issue (#8859)
zhyncs Aug 6, 2025
1bd5316
fix benchmark fp8 blockwise group gemm (#8815)
yuan-luo Aug 6, 2025
399e7ec
Refine naming (#8868)
ispobock Aug 6, 2025
0475448
Optimize triton swa kernel by skipping computation (#8860)
ispobock Aug 6, 2025
b114a81
Support B200 in CI (#8861)
fzyzcjy Aug 6, 2025
01c99a9
chore: update Dockerfile (#8872)
mickqian Aug 6, 2025
288ae41
[NVIDIA] Fix num_experts in modelopt_quant (#8811)
wenscarl Aug 6, 2025
78aad91
[CI] fix pip upgrade (#8881)
ch-wan Aug 6, 2025
cbbd685
chore: use torch 2.8 stable (#8880)
zhyncs Aug 6, 2025
92cc32d
Support v1/responses and use harmony in serving_chat (#8837)
CatherineSue Aug 6, 2025
c0e8429
Use reduce scatter for DP (#8539)
trevor-m Aug 6, 2025
4373df5
add flashinfer mxfp4 (#8847)
BBuf Aug 6, 2025
5b6acc1
fix glm4 moe (#8883)
ch-wan Aug 7, 2025
6ad6c8c
feat: openai oss attention sink support with trtllm-gen backend #8825…
yyihuang Aug 7, 2025
6210e2c
Support GPU pinning for LoRA (#8697)
lifuhuang Aug 7, 2025
3fa3c6c
Enables force reasoning based on chat template for Qwen3-Thinking (#8…
JustinTong0323 Aug 7, 2025
4f2e149
[AMD] Pull latest SGLang version for AMD CI (#8787)
michaelzhang-ai Aug 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
29 changes: 13 additions & 16 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,23 +1,20 @@
/3rdparty/amd @HaiShaw
.github @merrymercy @zhyncs
/docker @zhyncs @HaiShaw @ByronHsu
/docs @zhaochenyang20
/python/sglang/lang @merrymercy @Ying1123 @hnyls2002 @ByronHsu
/python/sglang/srt @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
/python/pyproject.toml @merrymercy @zhyncs
/python/sglang/* @merrymercy @Ying1123 @zhyncs @hnyls2002
/python/sglang/srt/constrained @hnyls2002
/python/sglang/srt/disaggregation @hnyls2002 @ByronHsu
/python/sglang/srt/distributed @yizhang2077
/python/sglang/srt/entrypoints @zhaochenyang20
/python/sglang/srt/entrypoints/openai @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu @CatherineSue
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw @ch-wan @BBuf
/python/sglang/srt/disaggregation @ByronHsu @hnyls2002
/python/sglang/srt/disaggregation/mooncake @ShangmingCai
/python/sglang/srt/distributed @yizhang2077 @merrymercy
/python/sglang/srt/entrypoints @ispobock @CatherineSue @slin1237 @merrymercy
/python/sglang/srt/eplb @fzyzcjy
/python/sglang/srt/function_call @CatherineSue
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw @ch-wan @BBuf @kushanam
/python/sglang/srt/lora @Ying1123 @Fridge003
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock
/python/sglang/srt/models @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu @zhaochenyang20
/python/sglang/srt/sampling @merrymercy @hnyls2002
/python/sglang/srt/speculative @Ying1123 @merrymercy @rkooo567 @kssteven418
/python/sglang/srt/multimodal @mickqian @JustinTong0323
/test/lang @merrymercy @Ying1123 @ByronHsu
/test/srt @merrymercy @Ying1123 @zhyncs
/sgl-router @ByronHsu @Ying1123 @slin1237
/sgl-kernel @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @merrymercy @yinfan98 @HaiShaw
/python/sglang/srt/speculative @Ying1123 @merrymercy @rkooo567 @kssteven418
/sgl-kernel @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw
/sgl-router @slin1237 @ByronHsu
8 changes: 8 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@

<!-- Describe the changes made in this PR. -->

## Accuracy Test

<!-- If this PR affects model-side code (e.g., kernels, model architecture), please provide accuracy test results. Ref: https://docs.sglang.ai/references/accuracy_evaluation.html -->

## Benchmark & Profiling

<!-- If this PR is expected to impact performance, please provide benchmark and profiling results. Ref: https://docs.sglang.ai/references/benchmark_and_profiling.html -->

## Checklist

- [ ] Format your code according to the [Code Formatting with Pre-Commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit).
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pr-test-amd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ jobs:
bash scripts/amd_ci_exec.sh python3 run_suite.py --suite per-commit-8-gpu-amd --timeout-per-file 3600

- name: Run CustomAllReduce test
timeout-minutes: 10
timeout-minutes: 20
run: |
bash scripts/amd_ci_exec.sh -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m unittest test_custom_allreduce.TestCustomAllReduce

Expand Down
130 changes: 130 additions & 0 deletions .github/workflows/pr-test-npu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
name: PR Test (Ascend NPU)

on:
push:
branches: [ main ]
paths:
- "python/**"
- "scripts/**"
- "test/**"
- ".github/workflows/pr-test-npu.yml"
pull_request:
branches: [ main ]
paths:
- "python/**"
- "scripts/**"
- "test/**"
- ".github/workflows/pr-test-npu.yml"
workflow_dispatch:

concurrency:
group: pr-test-npu-${{ github.ref }}
cancel-in-progress: true

jobs:
per-commit-1-ascend-npu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false
runs-on: linux-arm64-npu-1
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.2.rc1.alpha003-910b-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Install dependencies
run: |
bash scripts/npu_ci_install_dependency.sh
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy download through proxy
curl -o /tmp/test.jsonl -L https://gh-proxy.test.osinfra.cn/https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl

- name: Run test
timeout-minutes: 30
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
run: |
cd test/srt
python3 run_suite.py --suite per-commit-1-ascend-npu

per-commit-2-ascend-npu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false
runs-on: linux-arm64-npu-2
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.2.rc1.alpha003-910b-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Install dependencies
run: |
bash scripts/npu_ci_install_dependency.sh
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy download through proxy
curl -o /tmp/test.jsonl -L https://gh-proxy.test.osinfra.cn/https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl

- name: Run test
timeout-minutes: 30
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
run: |
cd test/srt
python3 run_suite.py --suite per-commit-2-ascend-npu

per-commit-4-ascend-npu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false
runs-on: linux-arm64-npu-4
container:
image: swr.cn-southwest-2.myhuaweicloud.com/base_image/ascend-ci/cann:8.2.rc1.alpha003-910b-ubuntu22.04-py3.11
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Install dependencies
run: |
bash scripts/npu_ci_install_dependency.sh
# copy required file from our daily cache
cp ~/.cache/modelscope/hub/datasets/otavia/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json /tmp
# copy download through proxy
curl -o /tmp/test.jsonl -L https://gh-proxy.test.osinfra.cn/https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl

- name: Run test
timeout-minutes: 30
env:
SGLANG_USE_MODELSCOPE: true
SGLANG_IS_IN_CI: true
HF_ENDPOINT: https://hf-mirror.com
TORCH_EXTENSIONS_DIR: /tmp/torch_extensions
run: |
cd test/srt
python3 run_suite.py --suite per-commit-4-ascend-npu --timeout-per-file 3600

finish:
if: always()
needs:
- per-commit-1-ascend-npu
- per-commit-2-ascend-npu
- per-commit-4-ascend-npu
runs-on: ubuntu-latest
steps:
- name: Check all dependent job statuses
run: |
results=(${{ join(needs.*.result, ' ') }})
for result in "${results[@]}"; do
if [ "$result" = "failure" ] || [ "$result" = "cancelled" ]; then
echo "Job failed with result: $result"
exit 1
fi
done
echo "All jobs completed successfully"
exit 0
Loading
Loading