Skip to content

Conversation

@merrymercy
Copy link
Contributor

No description provided.

@merrymercy merrymercy merged commit c4707f1 into main Jan 17, 2024
@merrymercy merrymercy deleted the doc branch January 17, 2024 03:53
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
chunyuan-w pushed a commit to chunyuan-w/sglang that referenced this pull request Mar 24, 2025
* Use rms norm kernel instead of vllm

* update
pi314ever pushed a commit to pi314ever/sglang that referenced this pull request Apr 23, 2025
chunyuan-w pushed a commit to chunyuan-w/sglang that referenced this pull request May 28, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 3, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 4, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 10, 2025
* Use rms norm kernel instead of vllm

* update
yanbing-j added a commit to yanbing-j/sglang that referenced this pull request Jun 18, 2025
* Use rms norm kernel instead of vllm

* update
pengxin99 pushed a commit to pengxin99/sglang that referenced this pull request Jun 19, 2025
yichiche pushed a commit to yichiche/sglang that referenced this pull request Jul 30, 2025
* fix decode

Signed-off-by: Ivan Butygin <[email protected]>

* fix

Signed-off-by: Ivan Butygin <[email protected]>

---------

Signed-off-by: Ivan Butygin <[email protected]>
yichiche pushed a commit to yichiche/sglang that referenced this pull request Aug 7, 2025
* fix decode

Signed-off-by: Ivan Butygin <[email protected]>

* fix

Signed-off-by: Ivan Butygin <[email protected]>

---------

Signed-off-by: Ivan Butygin <[email protected]>
yichiche pushed a commit to yichiche/sglang that referenced this pull request Aug 11, 2025
* fix decode

Signed-off-by: Ivan Butygin <[email protected]>

* fix

Signed-off-by: Ivan Butygin <[email protected]>

---------

Signed-off-by: Ivan Butygin <[email protected]>
JustinTong0323 added a commit that referenced this pull request Oct 30, 2025
* Fix dtype mismatch in rotary embedding with FP8 KV cache

When using FP8 KV cache quantization (e.g., with ModelOpt FP8 models),
the query and key tensors may have different dtypes during CUDA graph
capture. The query tensor remains in bfloat16 for computation, while
the key tensor might need to be in FP8 format for KV cache storage.

The issue was in DeepseekScalingRotaryEmbedding.forward_native() which
only captured query's dtype and then converted both query and key to
that same dtype. This caused a dtype mismatch error during CUDA graph
capture: "query and key must have the same dtype".

The fix preserves the original dtypes of both query and key tensors
separately, ensuring they maintain their intended dtypes after the
rotary position embedding computation.

This resolves the CUDA graph capture failure with Qwen3MoE and other
models using FP8 KV cache quantization.

* Fix FA4 dtype mismatch with FP8 KV cache

When using FlashAttention 4 (FA4) with FP8 KV cache quantization,
there was a dtype mismatch between the query tensor (bfloat16) and
the cached key/value tensors (FP8). FA4 requires all input tensors
(q, k, v) to have the same dtype.

The previous code only converted the query to FP8 when NOT using FA4
(fa_impl_ver != 4). This was based on the assumption that FA4 doesn't
support FP8, but actually FA4 CAN work with FP8 tensors as long as
all tensors have matching dtypes.

The key difference is that FA4 doesn't support descale parameters for
on-the-fly dequantization (unlike FA3). So we:
1. Convert query to FP8 to match the KV cache dtype for both FA3 and FA4
2. Only set k_descale/v_descale for FA3 (FA4 doesn't support them)

This resolves the "query and key must have the same dtype" error when
using FP8 KV cache with FA4.

---------

Co-authored-by: Cursor Agent <[email protected]>
JustinTong0323 added a commit that referenced this pull request Oct 30, 2025
key4ng pushed a commit to key4ng/sglang that referenced this pull request Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants