Conversation
…ding issue & add swiglu activation
fix the problem that mlp weight is not really loaded, and mlp bias support See merge request jinyanc/sglang_fork!2
…ry and key tensors
…er in OpenAIMoeConfig
This update introduces a new optional parameter 'bias' to the FusedMoE implementation and its associated quantization methods. When enabled, bias parameters are created for both gate_up_proj and down_proj layers. If disabled, these parameters are registered as None. This change enhances flexibility in model configuration.
…n OpenAIMoeSparseMoeBlock
1. fix mismatch including truncatation, max_positions, etc. 2. simiply fallback to native apply rope impl as sgl-kernel version does not support oai version.
align with reference code See merge request jinyanc/sglang_fork!3
…salLM. The update ensures that parameters with ".mlp.router." in their names are correctly mapped to ".mlp.gate." and loaded appropriately, enhancing model compatibility.
…eters in OpenAIMoeForCausalLM. The change specifies that OpenAIMoe uses 'router' to refer to 'gate', improving code readability and understanding.
fix gate not load bug See merge request jinyanc/sglang_fork!4
There was a problem hiding this comment.
seems not needed, if not >= 0 flashinfer will not use sliding window
|
Summary:
|
|
So this will not work on Ada series GPUs? |
We never tested on Ada. Feel free to try it. |
Nope I guess its due to mxfp4. Any plans for this ? |
|
@chriswritescode-dev please refer to #8833 |
| return output_q, scale | ||
|
|
||
| def swizzle(x: torch.Tensor): | ||
| assert len(x.shape) == 3 |
There was a problem hiding this comment.
hi @hlu1 great work!
May I know the reason to swizzle a tensor outside of main gpu kernel ? Swizzle is realted to shared memory layout, and it has nothing to do with its gpu main memory layout. I am really interested why this helps ?
There was a problem hiding this comment.
Swizzling is very important for the kernel performance for block scale formats on Hopper and Blackwell. It's required to get optimal global memory access.
Motivation
Authors @xutizhou, @liz-badada, @zhuofan1123, @linhu-nv from NVIDIA Solution Architect Team
Many thanks to @qsang-nv, @PerkzZheng, @yzh119 for helping integrate attention sink with FlashInfer attention kernels, @dongfengy for helping integrate the Triton MoE kernel.
SGLang now support GPT-OSS. Here is a simple guide provides step-by-step instructions for deploying GPT-OSS with SGLang or running benchmark tests. (Note: Current implementation focuses on functional completeness rather than performance optimization. Community contributions for performance improvements are welcome. Some workarounds exist in the current codebase and will be addressed in future updates.)
Main Modifications
--attention-backend [flashinfer, torch_native_sink],--enable-fp8-actfor MoE MXFP4 W4A8Prerequisites
lmsysorg/sglang:latest, on Blackwell:lmsysorg/sglang:v0.4.8-cu128-b200Model Weights
Get model weights from HuggingFace Model Card
Command Example
To serve GPT-OSS model, refer to the following command:
Launch Server
python3 -m sglang.launch_server \ --model-path <path-to-gpt-oss> \ --tp-size 4 \ --enable-fp8-act \ --cuda-graph-bs 256 \ --max-running-requests 256 \ --attention-backend flashinferBench Serving
python3 -m sglang.bench_serving \ --backend sglang \ --dataset-name random \ --num-prompts 6144 \ --random-input-len 1000 \ --random-output-len 2000 \ --random-range-ratio 1Note:
--enable-fp8-act.Bench offline
To bench GPT-OSS model offline, refer to the following command:
python3 -m sglang.bench_offline_throughput \ --model-path <path-to-gpt-oss> \ --tp-size 4 \ --num-prompts 6144 \ --enable-fp8-act \ --cuda-graph-bs 256 \ --dataset-name random \ --random-input 1000 \ --random-output 2000\ --random-range-ratio 1 \ --max-running-requests 256 \ --attention-backend [flashinfer]Simple Test
Note
If meet with some issues on Blackwell, refer #7227
Modifications
Accuracy Test
Benchmark & Profiling
Checklist