Skip to content

[refactor] Move trtllm_fp8_kv_kernel to triton_ops directory#15044

Merged
Fridge003 merged 2 commits intosgl-project:mainfrom
harvenstar:move-trtllm-fp8-kv-to-triton-ops
Dec 14, 2025
Merged

[refactor] Move trtllm_fp8_kv_kernel to triton_ops directory#15044
Fridge003 merged 2 commits intosgl-project:mainfrom
harvenstar:move-trtllm-fp8-kv-to-triton-ops

Conversation

@harvenstar
Copy link
Copy Markdown
Collaborator

@harvenstar harvenstar commented Dec 13, 2025

PR Description:

Move trtllm_fp8_kv_kernel.py to python/sglang/srt/layers/attention/triton_ops/ for
better code organization, as suggested in #14553.

@b8zhong @Fridge003

Updated all import references accordingly.

Testing

Tested with serving benchmark - no regression observed.

Command:
CUDA_LAUNCH_BLOCKING=1 python3 -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --trust-remote-code --tp 4 --attention-backend trtllm_mha --kv-cache-dtype fp8_e4m3

python3 -m sglang.bench_serving \
  --backend sglang-oai \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --random-range-ratio 0.98 \
  --num-prompts 80 \
  --max-concurrency 16

Results:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     80
Benchmark duration (s):                  79.99
Total input tokens:                      81050
Total input text tokens:                 81050
Total input vision tokens:               0
Total generated tokens:                  81085
Total generated tokens (retokenized):    10185
Request throughput (req/s):              1.00
Input token throughput (tok/s):          1013.21
Output token throughput (tok/s):         1013.65
Peak output token throughput (tok/s):    1200.00
Peak concurrent requests:                23
Total token throughput (tok/s):          2026.85
Concurrency:                             15.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   15922.70
Median E2E Latency (ms):                 16067.41
---------------Time to First Token----------------
Mean TTFT (ms):                          414.21
Median TTFT (ms):                        282.49
P99 TTFT (ms):                           839.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.31
Median TPOT (ms):                        15.47
P99 TPOT (ms):                           16.62
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.32
Median ITL (ms):                         13.45
P95 ITL (ms):                            13.82
P99 ITL (ms):                            28.76
Max ITL (ms):                            454.27

Move trtllm_fp8_kv_kernel.py to python/sglang/srt/layers/attention/triton_ops/
for better code organization, as suggested in sgl-project#14553.

Updated all import references accordingly.

Tested with serving benchmark - no regression observed.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the blackwell SM100/SM120 label Dec 13, 2025
@harvenstar
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@harvenstar
Copy link
Copy Markdown
Collaborator Author

a small followup pr to #14553, file position adjustment, pls review. @Fridge003 Thanks!!

@Fridge003
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Dec 14, 2025

Screenshot 2025-12-14 at 1 00 15 PM

@b8zhong b8zhong enabled auto-merge (squash) December 14, 2025 21:00
@Fridge003 Fridge003 disabled auto-merge December 14, 2025 23:22
@Fridge003 Fridge003 merged commit 99cb2ed into sgl-project:main Dec 14, 2025
70 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants