Improve docs #17

merrymercy · 2024-01-17T03:50:06Z

No description provided.

* Use rms norm kernel instead of vllm * update

* fix decode Signed-off-by: Ivan Butygin <[email protected]> * fix Signed-off-by: Ivan Butygin <[email protected]> --------- Signed-off-by: Ivan Butygin <[email protected]>

* Fix dtype mismatch in rotary embedding with FP8 KV cache When using FP8 KV cache quantization (e.g., with ModelOpt FP8 models), the query and key tensors may have different dtypes during CUDA graph capture. The query tensor remains in bfloat16 for computation, while the key tensor might need to be in FP8 format for KV cache storage. The issue was in DeepseekScalingRotaryEmbedding.forward_native() which only captured query's dtype and then converted both query and key to that same dtype. This caused a dtype mismatch error during CUDA graph capture: "query and key must have the same dtype". The fix preserves the original dtypes of both query and key tensors separately, ensuring they maintain their intended dtypes after the rotary position embedding computation. This resolves the CUDA graph capture failure with Qwen3MoE and other models using FP8 KV cache quantization. * Fix FA4 dtype mismatch with FP8 KV cache When using FlashAttention 4 (FA4) with FP8 KV cache quantization, there was a dtype mismatch between the query tensor (bfloat16) and the cached key/value tensors (FP8). FA4 requires all input tensors (q, k, v) to have the same dtype. The previous code only converted the query to FP8 when NOT using FA4 (fa_impl_ver != 4). This was based on the assumption that FA4 doesn't support FP8, but actually FA4 CAN work with FP8 tensors as long as all tensors have matching dtypes. The key difference is that FA4 doesn't support descale parameters for on-the-fly dequantization (unlike FA3). So we: 1. Convert query to FP8 to match the KV cache dtype for both FA3 and FA4 2. Only set k_descale/v_descale for FA3 (FA4 doesn't support them) This resolves the "query and key must have the same dtype" error when using FP8 KV cache with FA4. --------- Co-authored-by: Cursor Agent <[email protected]>

This reverts commit fc80fbd.

merrymercy added 2 commits January 17, 2024 03:49

improve docs

0740a51

update

038a868

merrymercy merged commit c4707f1 into main Jan 17, 2024

merrymercy deleted the doc branch January 17, 2024 03:53

CSEEduanyu mentioned this pull request Jan 26, 2025

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3 #2803

Closed

5 tasks

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Improve docs (sgl-project#17)

9202886

chunyuan-w pushed a commit to chunyuan-w/sglang that referenced this pull request Mar 24, 2025

Use rms norm kernel instead of vllm (sgl-project#17)

1d6d2ef

* Use rms norm kernel instead of vllm * update

pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025

Added device agnostic changes to HF Runner. (sgl-project#17)

5820d0c

yuleiqin mentioned this pull request May 26, 2025

[Bug] main pd version Exception: Failed to encode tensor map: 700 #6590

Closed

5 tasks

chunyuan-w pushed a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025

Use rms norm kernel instead of vllm (sgl-project#17)

2b5cf90

* Use rms norm kernel instead of vllm * update

yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 3, 2025

Use rms norm kernel instead of vllm (sgl-project#17)

4e8bc55

* Use rms norm kernel instead of vllm * update

yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 4, 2025

Use rms norm kernel instead of vllm (sgl-project#17)

454b351

* Use rms norm kernel instead of vllm * update

yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 10, 2025

Use rms norm kernel instead of vllm (sgl-project#17)

061b52d

* Use rms norm kernel instead of vllm * update

yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 18, 2025

Use rms norm kernel instead of vllm (sgl-project#17)

56f391c

* Use rms norm kernel instead of vllm * update

pengxin99 pushed a commit to pengxin99/sglang that referenced this pull request Jun 19, 2025

Integrate FP8 GEMV/GEMM into sglang (sgl-project#17)

483f236

Zhou-sx mentioned this pull request Jun 19, 2025

[Bug] Deepseek EP + DP Fail and Accuracy Crush #7041

Closed

5 tasks

ericschreiber mentioned this pull request Aug 13, 2025

[Bug] CUDA error: uncorrectable ECC error encountered when using HiCache with xPyD disaggregation. #9151

Open

5 tasks

gaolaobao mentioned this pull request Aug 25, 2025

[Bug] RTX 5060: RMSNorm failed, same as the #7249 issue, when running qwen2.5-0.5b-instruct model. #9600

Closed

5 tasks

JustinTong0323 added a commit that referenced this pull request Oct 30, 2025

Revert "Fix cuda graph capture dtype mismatch (#17)"

92b9c59

This reverts commit fc80fbd.

key4ng pushed a commit to key4ng/sglang that referenced this pull request Nov 9, 2025

fix: Resolve all mypy type errors (sgl-project#17)

ed5d395

0xymoro mentioned this pull request Nov 10, 2025

[Bug] 0.5.5 custom all reduce crashing #13016

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve docs #17

Improve docs #17

Uh oh!

merrymercy commented Jan 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve docs #17

Improve docs #17

Uh oh!

Conversation

merrymercy commented Jan 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants