[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer by azhurkevich · Pull Request #8552 · sgl-project/sglang

azhurkevich · 2025-07-30T03:18:12Z

Motivation

Bring best low latency NVFP4 kernels for Blackwell MoE. Currently enabling DSR1.

Modifications

Changing some weight preprocessing logic as well as exposing these kernels. Plus various piping to make it work.

Accuracy Test

Ran accuracy tests, find description and repro below.

Benchmark & Profiling

Added below with repros.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist · 2025-07-30T03:18:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/models/deepseek_v2.py

nekorobov · 2025-07-30T13:33:33Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+
+            # Additional parameter needed for TRT-LLM
+            layer.g1_scale_c = Parameter(
+                (layer.w2_input_scale_quant * layer.g1_alphas).to(torch.float32),


FC1 is nvfp4 x nvfp4 -> nvfp4, then the scaleC factor for FC1 is dequantA * dequantB * quantC. I am not sure what is g1_alphas here

g1_alphas is used for scaleGated. ScaleGate should be dequantA * dequantB.

scaleC for FC2 must be dequantA * dequantB as it takes nvfp4 as inputs and outputs bf16. Just checking if the logic is as expected

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

azhurkevich · 2025-07-31T03:24:28Z

Seems to work, just ran some evals quickly

flashinfer trtllmgen moe:

python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-R1-FP4 --trust-remote-code --tp-size 4 --quantization modelopt_fp4 --enable-flashinfer-trtllm-moe --disable-shared-experts-fusion &
echo $! > sglang_server.pid
echo "Server PID: $(cat sglang_server.pid)"


Evals:
python3 benchmark/gsm8k/bench_sglang.py \
  --num-questions 900 \
  --parallel 32 \
  --num-shots 8  
  
  
Accuracy: 0.963
Invalid: 0.000
Latency: 512.660 s
Output throughput: 178.106 token/s  
  
  
Kill process:
kill $(cat sglang_server.pid)

baseline, flashinfer cutlass moe (disabled CUDA graph as it was crashing at high num question counts with seg fault):

python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-R1-FP4 --trust-remote-code --tp-size 4 --quantization modelopt_fp4 --enable-flashinfer-cutlass-moe --disable-shared-experts-fusion  --disable-cuda-graph &
echo $! > sglang_server.pid
echo "Server PID: $(cat sglang_server.pid)"


Evals:
python3 benchmark/gsm8k/bench_sglang.py \
  --num-questions 900 \
  --parallel 32 \
  --num-shots 8


Accuracy: 0.961
Invalid: 0.000
Latency: 772.978 s
Output throughput: 117.472 token/s


Kill process:
kill $(cat sglang_server.pid)

zhyncs · 2025-08-01T19:42:34Z

@azhurkevich @kushanam May you help rebase this? Thanks!

azhurkevich · 2025-08-01T20:01:01Z

Yeah, I'll rebase it. Working on perf now

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

azhurkevich · 2025-08-03T21:49:43Z

Commands used to launch and bench server. For repro steps.

Default backend
python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-R1-FP4 --trust-remote-code --tp-size 4 --quantization modelopt_fp4

CUTLASS fused_moe flashinfer backend:
python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-R1-FP4 --trust-remote-code --tp-size 4 --quantization modelopt_fp4 --enable-flashinfer-cutlass-moe

trtllmgen backend:
python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-R1-FP4 --trust-remote-code --tp-size 4 --quantization modelopt_fp4 --enable-flashinfer-trtllm-moe --disable-shared-experts-fusion

Universal command to run benchmarking:

curl http://127.0.0.1:30000/flush_cache && \
python3 -m sglang.bench_serving --backend sglang-oai \
    --dataset-name random --random-input-len 1000 --random-output-len 1000 \
    --random-range-ratio 1 --num-prompts 5 --max-concurrency 1 \
    --warmup-requests 5 --output-file dsv3_con1.jsonl && \
curl http://127.0.0.1:30000/flush_cache && \
python3 -m sglang.bench_serving --backend sglang-oai \
    --dataset-name random --random-input-len 1000 --random-output-len 1000 \
    --random-range-ratio 1 --num-prompts 20 --max-concurrency 4 \
    --warmup-requests 5 --output-file dsv3_con4.jsonl && \
curl http://127.0.0.1:30000/flush_cache && \
python3 -m sglang.bench_serving --backend sglang-oai \
    --dataset-name random --random-input-len 1000 --random-output-len 1000 \
    --random-range-ratio 1 --num-prompts 80 --max-concurrency 16 \
    --warmup-requests 5 --output-file dsv3_con16.jsonl && \
curl http://127.0.0.1:30000/flush_cache && \
python3 -m sglang.bench_serving --backend sglang-oai \
    --dataset-name random --random-input-len 1000 --random-output-len 1000 \
    --random-range-ratio 1 --num-prompts 160 --max-concurrency 32 \
    --warmup-requests 5 --output-file dsv3_con32.jsonl

azhurkevich · 2025-08-03T21:50:55Z

Default backend perf:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     5         
Benchmark duration (s):                  83.35     
Total input tokens:                      5000      
Total generated tokens:                  5000      
Total generated tokens (retokenized):    4995      
Request throughput (req/s):              0.06      
Input token throughput (tok/s):          59.99     
Output token throughput (tok/s):         59.99     
Total token throughput (tok/s):          119.98    
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16664.90  
Median E2E Latency (ms):                 16593.61  
---------------Time to First Token----------------
Mean TTFT (ms):                          213.34    
Median TTFT (ms):                        142.98    
P99 TTFT (ms):                           522.44    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           16.47     
Median ITL (ms):                         16.47     
P95 ITL (ms):                            16.63     
P99 ITL (ms):                            16.69     
Max ITL (ms):                            49.54     
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     20        
Benchmark duration (s):                  102.19    
Total input tokens:                      20000     
Total generated tokens:                  20000     
Total generated tokens (retokenized):    19938     
Request throughput (req/s):              0.20      
Input token throughput (tok/s):          195.72    
Output token throughput (tok/s):         195.72    
Total token throughput (tok/s):          391.44    
Concurrency:                             4.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20431.53  
Median E2E Latency (ms):                 20579.65  
---------------Time to First Token----------------
Mean TTFT (ms):                          594.20    
Median TTFT (ms):                        753.91    
P99 TTFT (ms):                           824.97    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.91     
Median ITL (ms):                         19.86     
P95 ITL (ms):                            20.04     
P99 ITL (ms):                            20.11     
Max ITL (ms):                            656.44    
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     80        
Benchmark duration (s):                  153.64    
Total input tokens:                      80000     
Total generated tokens:                  80000     
Total generated tokens (retokenized):    79824     
Request throughput (req/s):              0.52      
Input token throughput (tok/s):          520.69    
Output token throughput (tok/s):         520.69    
Total token throughput (tok/s):          1041.37   
Concurrency:                             16.00     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   30720.48  
Median E2E Latency (ms):                 30798.60  
---------------Time to First Token----------------
Mean TTFT (ms):                          1798.36   
Median TTFT (ms):                        1758.86   
P99 TTFT (ms):                           2208.70   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           29.01     
Median ITL (ms):                         28.78     
P95 ITL (ms):                            29.07     
P99 ITL (ms):                            29.20     
Max ITL (ms):                            1422.57   
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     160       
Benchmark duration (s):                  204.60    
Total input tokens:                      160000    
Total generated tokens:                  160000    
Total generated tokens (retokenized):    159645    
Request throughput (req/s):              0.78      
Input token throughput (tok/s):          782.00    
Output token throughput (tok/s):         782.00    
Total token throughput (tok/s):          1564.00   
Concurrency:                             31.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   40911.53  
Median E2E Latency (ms):                 40983.37  
---------------Time to First Token----------------
Mean TTFT (ms):                          3249.13   
Median TTFT (ms):                        3410.76   
P99 TTFT (ms):                           4278.45   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           37.76     
Median ITL (ms):                         37.38     
P95 ITL (ms):                            37.79     
P99 ITL (ms):                            37.91     
Max ITL (ms):                            3203.13   
==================================================

azhurkevich · 2025-08-03T21:52:10Z

CUTLASS fused_moe perf:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     5         
Benchmark duration (s):                  64.10     
Total input tokens:                      5000      
Total generated tokens:                  5000      
Total generated tokens (retokenized):    4986      
Request throughput (req/s):              0.08      
Input token throughput (tok/s):          78.00     
Output token throughput (tok/s):         78.00     
Total token throughput (tok/s):          155.99    
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12817.68  
Median E2E Latency (ms):                 12740.47  
---------------Time to First Token----------------
Mean TTFT (ms):                          167.36    
Median TTFT (ms):                        90.98     
P99 TTFT (ms):                           464.31    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           12.69     
Median ITL (ms):                         12.65     
P95 ITL (ms):                            12.84     
P99 ITL (ms):                            12.89     
Max ITL (ms):                            38.32     
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     20        
Benchmark duration (s):                  82.62     
Total input tokens:                      20000     
Total generated tokens:                  20000     
Total generated tokens (retokenized):    19969     
Request throughput (req/s):              0.24      
Input token throughput (tok/s):          242.07    
Output token throughput (tok/s):         242.07    
Total token throughput (tok/s):          484.14    
Concurrency:                             4.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16520.44  
Median E2E Latency (ms):                 16581.57  
---------------Time to First Token----------------
Mean TTFT (ms):                          496.05    
Median TTFT (ms):                        599.79    
P99 TTFT (ms):                           636.77    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           16.06     
Median ITL (ms):                         16.04     
P95 ITL (ms):                            16.21     
P99 ITL (ms):                            16.29     
Max ITL (ms):                            507.20    
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     80        
Benchmark duration (s):                  130.25    
Total input tokens:                      80000     
Total generated tokens:                  80000     
Total generated tokens (retokenized):    79785     
Request throughput (req/s):              0.61      
Input token throughput (tok/s):          614.21    
Output token throughput (tok/s):         614.21    
Total token throughput (tok/s):          1228.41   
Concurrency:                             16.00     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   26043.24  
Median E2E Latency (ms):                 26008.36  
---------------Time to First Token----------------
Mean TTFT (ms):                          1070.99   
Median TTFT (ms):                        1035.21   
P99 TTFT (ms):                           1389.24   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.05     
Median ITL (ms):                         24.95     
P95 ITL (ms):                            25.24     
P99 ITL (ms):                            25.40     
Max ITL (ms):                            827.10    
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     160       
Benchmark duration (s):                  176.18    
Total input tokens:                      160000    
Total generated tokens:                  160000    
Total generated tokens (retokenized):    159616    
Request throughput (req/s):              0.91      
Input token throughput (tok/s):          908.14    
Output token throughput (tok/s):         908.14    
Total token throughput (tok/s):          1816.28   
Concurrency:                             31.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   35226.74  
Median E2E Latency (ms):                 35390.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          1930.92   
Median TTFT (ms):                        2012.45   
P99 TTFT (ms):                           2723.38   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.41     
Median ITL (ms):                         33.23     
P95 ITL (ms):                            33.66     
P99 ITL (ms):                            33.86     
Max ITL (ms):                            1819.86   
==================================================

azhurkevich · 2025-08-03T21:53:14Z

trtllmgen perf:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     5         
Benchmark duration (s):                  44.20     
Total input tokens:                      5000      
Total generated tokens:                  5000      
Total generated tokens (retokenized):    4997      
Request throughput (req/s):              0.11      
Input token throughput (tok/s):          113.12    
Output token throughput (tok/s):         113.12    
Total token throughput (tok/s):          226.24    
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8837.28   
Median E2E Latency (ms):                 8599.33   
---------------Time to First Token----------------
Mean TTFT (ms):                          363.58    
Median TTFT (ms):                        125.85    
P99 TTFT (ms):                           1063.32   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8.49      
Median ITL (ms):                         8.47      
P95 ITL (ms):                            8.66      
P99 ITL (ms):                            8.70      
Max ITL (ms):                            25.53     
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     20        
Benchmark duration (s):                  56.80     
Total input tokens:                      20000     
Total generated tokens:                  20000     
Total generated tokens (retokenized):    19941     
Request throughput (req/s):              0.35      
Input token throughput (tok/s):          352.12    
Output token throughput (tok/s):         352.12    
Total token throughput (tok/s):          704.25    
Concurrency:                             4.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11355.31  
Median E2E Latency (ms):                 11631.33  
---------------Time to First Token----------------
Mean TTFT (ms):                          615.56    
Median TTFT (ms):                        325.56    
P99 TTFT (ms):                           1150.78   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           10.78     
Median ITL (ms):                         10.64     
P95 ITL (ms):                            10.80     
P99 ITL (ms):                            10.87     
Max ITL (ms):                            925.45    
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     80        
Benchmark duration (s):                  89.17     
Total input tokens:                      80000     
Total generated tokens:                  80000     
Total generated tokens (retokenized):    79758     
Request throughput (req/s):              0.90      
Input token throughput (tok/s):          897.19    
Output token throughput (tok/s):         897.19    
Total token throughput (tok/s):          1794.37   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   17827.39  
Median E2E Latency (ms):                 17594.14  
---------------Time to First Token----------------
Mean TTFT (ms):                          1671.07   
Median TTFT (ms):                        1459.20   
P99 TTFT (ms):                           2193.36   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           16.21     
Median ITL (ms):                         16.13     
P95 ITL (ms):                            16.33     
P99 ITL (ms):                            16.45     
Max ITL (ms):                            1014.18   
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     160       
Benchmark duration (s):                  116.25    
Total input tokens:                      160000    
Total generated tokens:                  160000    
Total generated tokens (retokenized):    159463    
Request throughput (req/s):              1.38      
Input token throughput (tok/s):          1376.31   
Output token throughput (tok/s):         1376.31   
Total token throughput (tok/s):          2752.62   
Concurrency:                             31.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23241.97  
Median E2E Latency (ms):                 23089.07  
---------------Time to First Token----------------
Mean TTFT (ms):                          2501.09   
Median TTFT (ms):                        2370.11   
P99 TTFT (ms):                           3471.56   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           20.82     
Median ITL (ms):                         20.65     
P95 ITL (ms):                            20.91     
P99 ITL (ms):                            21.08     
Max ITL (ms):                            1614.73   
==================================================

azhurkevich · 2025-08-03T21:54:04Z

trtllmgen vs. default backend

azhurkevich · 2025-08-03T21:55:21Z

trtllmgen vs. CUTLASS fused_moe

ch-wan · 2025-08-04T00:22:28Z

@azhurkevich Let me help with resolving the merge conflict.

azhurkevich · 2025-08-04T01:19:00Z

@ch-wan oof, sorry just noticed your comment. I squashed everything and doing final rebase. Do you want me to do it or you would like to do it?

zhyncs · 2025-08-04T01:21:08Z

@azhurkevich go ahead

zhyncs · 2025-08-04T04:36:04Z

pre-commit run --all-files

@azhurkevich

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

python/sglang/srt/server_args.py

python/sglang/srt/models/deepseek_v2.py

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

python/sglang/srt/layers/quantization/modelopt_quant.py

pavanimajety · 2025-08-04T17:32:23Z

python/sglang/srt/server_args.py

+                self.disable_shared_experts_fusion = True
+                logger.warning(
+                    "FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set."
+                )


This is not required anymore?

Hi @trevor-m can you help check this? Thanks! If so, may you help submit a pr for this @pavanimajety

yuan-luo · 2025-08-05T01:21:25Z

What flashinfer_python version does it require? I tried 0.2.3 to 0.2.7.post1, it is lack of RoutingMethodType which is newly added in this PR.


[2025-08-05 09:13:32 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2534, in run_scheduler_process
    scheduler = Scheduler(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 313, in __init__
    self.tp_worker = TpWorkerClass(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in __init__
    self.worker = TpModelWorker(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 84, in __init__
    self.model_runner = ModelRunner(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 242, in __init__
    self.initialize(min_per_gpu_memory)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 285, in initialize
    self.load_model()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 640, in load_model
    monkey_patch_isinstance_for_vllm_base_layer()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/quantization/__init__.py", line 163, in monkey_patch_isinstance_for_vllm_base_layer
    from sglang.srt.layers.moe.fused_moe_triton.layer import FusedMoE as PatchedFusedMoE
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/__init__.py", line 10, in <module>
    from sglang.srt.layers.moe.fused_moe_triton.layer import (
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 44, in <module>
    from flashinfer import (
ImportError: cannot import name 'RoutingMethodType' from 'flashinfer' (/opt/conda/lib/python3.10/site-packages/flashinfer/__init__.py)

[2025-08-05 09:13:32] Received sigquit from a chil

azhurkevich · 2025-08-05T01:25:37Z

@yuan-luo you at least gotta use flashinfer version v0.2.9rc1. These kernels didnt exist in flashinfer before that version came out

yuan-luo · 2025-08-05T01:31:02Z

@yuan-luo you at least gotta use flashinfer version v0.2.9rc1. These kernels didnt exist in flashinfer before that version came out

Is it possible to make this feature optional if we are not using fp4? I guess it may impact lots of users, for example the local flashinfer repo mirror in my environment, the highest flashinfer version is v0.2.7.post1.

…oject#8552) Co-authored-by: Cheng Wan <cwan@x.ai>

azhurkevich mentioned this pull request Jul 30, 2025

[Feature] Integrating FlashInfer FP4/FP8 Low-Latency MoE Kernels for DSR1 #8037

Closed

2 tasks

fzyzcjy reviewed Jul 30, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

nekorobov reviewed Jul 30, 2025

View reviewed changes

zhyncs assigned Alcanderian Jul 31, 2025

zhyncs added the high priority label Jul 31, 2025

zhyncs assigned kushanam and zhyncs Aug 1, 2025

zhyncs added the collaboration label Aug 1, 2025

zhyncs assigned azhurkevich and fzyzcjy Aug 1, 2025

zhyncs reviewed Aug 1, 2025

View reviewed changes

python/sglang/srt/layers/moe/fused_moe_triton/layer.py Show resolved Hide resolved

ch-wan self-assigned this Aug 4, 2025

azhurkevich force-pushed the low_latency_nvfp4_decode branch from 04f3c1f to 0ff4840 Compare August 4, 2025 01:17

ALEXANDER ZHURKEVICH added 2 commits August 3, 2025 18:51

Squash

3e37343

Rebased

bc9fb6c

azhurkevich force-pushed the low_latency_nvfp4_decode branch from 0ff4840 to bc9fb6c Compare August 4, 2025 04:31

zhyncs marked this pull request as ready for review August 4, 2025 04:34

zhyncs requested review from BBuf, HaiShaw, ch-wan, ispobock, kushanam and xiezhq-hermann as code owners August 4, 2025 04:34

Merge branch 'main' into low_latency_nvfp4_decode

3b74195

ch-wan reviewed Aug 4, 2025

View reviewed changes

ch-wan added 3 commits August 4, 2025 00:02

clean up

137ce54

clean up

666250e

upd

cb91e54

zhyncs merged commit 915140f into main Aug 4, 2025
60 of 67 checks passed

zhyncs deleted the low_latency_nvfp4_decode branch August 4, 2025 10:10

trevor-m mentioned this pull request Aug 4, 2025

[bugfix] Fix typo in modelopt quant: 'FusedMoE' object has no attribute 'local_num_experts' #8768

Merged

pavanimajety reviewed Aug 4, 2025

View reviewed changes

kaixih mentioned this pull request Aug 4, 2025

[NVIDIA] Fix breakage of using trtllm-gen fp8 moe #8773

Merged

wenscarl mentioned this pull request Aug 5, 2025

[NVIDIA]Fix local_num_experts for EP #8779

Merged

6 tasks

pavanimajety mentioned this pull request Aug 5, 2025

[NVIDIA] Fix num_experts in modelopt_quant #8811

Merged

6 tasks

aleozlx mentioned this pull request Aug 13, 2025

Faster weight processing (trtllm-gen moe nvfp4) #9162

Merged

4 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (sgl-pr…

39746b6

…oject#8552) Co-authored-by: Cheng Wan <cwan@x.ai>

fzyzcjy mentioned this pull request Aug 23, 2025

Reintroduce memory usage fix #9535

Merged

4 tasks

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer (sgl-pr…

6099610

…oject#8552) Co-authored-by: Cheng Wan <cwan@x.ai>

Conversation

azhurkevich commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

nekorobov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

nekorobov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

nekorobov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

azhurkevich commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Aug 1, 2025

Uh oh!

azhurkevich commented Aug 1, 2025

Uh oh!

Uh oh!

azhurkevich commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azhurkevich commented Aug 3, 2025

Uh oh!

azhurkevich commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azhurkevich commented Aug 3, 2025

Uh oh!

azhurkevich commented Aug 3, 2025

Uh oh!

azhurkevich commented Aug 3, 2025

Uh oh!

ch-wan commented Aug 4, 2025

Uh oh!

azhurkevich commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Aug 4, 2025

Uh oh!

zhyncs commented Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pavanimajety Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Aug 5, 2025

Uh oh!

azhurkevich commented Aug 5, 2025

Uh oh!

yuan-luo commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

azhurkevich commented Jul 30, 2025 •

edited

Loading

azhurkevich commented Jul 31, 2025 •

edited

Loading

azhurkevich commented Aug 3, 2025 •

edited

Loading

azhurkevich commented Aug 3, 2025 •

edited

Loading

azhurkevich commented Aug 4, 2025 •

edited

Loading

yuan-luo commented Aug 5, 2025 •

edited

Loading