Skip to content

[CPU] Add native support for Qwen3-next#12305

Closed
blzheng wants to merge 6 commits intosgl-project:mainfrom
blzheng:beilei/qwen3_next_native
Closed

[CPU] Add native support for Qwen3-next#12305
blzheng wants to merge 6 commits intosgl-project:mainfrom
blzheng:beilei/qwen3_next_native

Conversation

@blzheng
Copy link
Contributor

@blzheng blzheng commented Oct 29, 2025

Motivation

This pr adds native support for Qwen3-next on CPU.

Modifications

  1. add CPU native implementations for the following operations:
    a. causal_conv1d_fn
    b. causal_conv1d_update
    c. chunk_gated_delta_rule
    d. fused_sigmoid_gating_delta_rule_update
    e. fused_gdn_gating
    f. Qwen3NextRMSNormGated
  2. fix issues in amx backend:
    a. Weight packing dtype check: weight packing did not support torch.float. This pr adds dtype validation before packing weight
    b. HybridLinearKVPool layer ID handling: Only full attention layers can access get_value_buffer, but layer_id = 0 is not always a full attention layer. This PR updates the logic to handle such cases correctly.
    c. Top-k kernel support: Top-k related kernels lacked support for num_experts = 512. This PR adds support for this configuration.

Accuracy Tests

Accuracy on GSM8k:
command line: SGLANG_USE_CPU_ENGINE=1 python -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --trust-remote-code --device cpu --tp 4 --dtype bfloat16 --mem-fraction-static 0.8 --max-total-tokens 65536 --disable-overlap-schedule
Accuracy: 0.942
Invalid: 0.000
Latency: 3855.785 s
Output throughput: 42.622 token/s

Benchmarking and Profiling

Checklist

start_q = 0
for i in range(batch_size):
end_q = query_start_loc[i + 1]
x_i, final_states = causal_conv1d_ref(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's wait until sgl-kernel is merged and then replace all the ref with real kernels.

self.variance_epsilon = eps

def forward(self, hidden_states, gate=None):
input_dtype = hidden_states.dtype
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we directly use forward_cpu in RMSNorm?

@@ -525,6 +528,9 @@ topk_softmax_cpu(at::Tensor& hidden_states, at::Tensor& gating_output, int64_t t
case 256:
LAUNCH_TOPK_SOFTMAX_KERNEL(256);
break;
case 512:
LAUNCH_TOPK_SOFTMAX_KERNEL(512);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need split it into another pr

@blzheng
Copy link
Contributor Author

blzheng commented Jan 23, 2026

@yizhang2077 Thanks for the review. This PR has been retired because the necessary changes are already included in #12525. Closing this PR.

@blzheng blzheng closed this Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants