-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Support gpt-oss #22259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gpt-oss #22259
Conversation
simon env adjustment Signed-off-by: simon-mo <[email protected]> fix tokenizer and able to startup Signed-off-by: simon-mo <[email protected]> e2e runnable Signed-off-by: simon-mo <[email protected]>
shuffled version working in unit test bug fix have fp8 support on unit test add mxfp4 quant method, implementation is using fp8 until better kernel understanding change config name to be compatible with hf config, model runnable mxfp4 class working, everything still in bf16 preliminary mxfp4 tests Revert "change config name to be compatible with hf config, model runnable" This reverts commit 736bf907fc2f7b028b171b402c19034c0e43c6e8. integrate into model, still bf16 mxfp4 kernel works somehow clean up assertions experimental mxfp4 kernel in vllm cleanup intermediate tensor os the model can run with tp=8 use exact dtype when loading update tests adding swizzle padding in test implement padding to enable hbm_swizzling move quantization to weight loader remove activation padding, only pad the weight remove preallocated tensor in mxfp4 moe method, model can be run with tp=1 move bias post processing after loading to save memory move bias addition to rank 0 formatting verified working
Signed-off-by: simon-mo <[email protected]> weight loading cleanup Signed-off-by: simon-mo <[email protected]> rename oai -> openaimoe for HF compat Signed-off-by: simon-mo <[email protected]> format Signed-off-by: simon-mo <[email protected]> finished rebase Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Reduce weight padding since it is handled inside convert_layout function in triton_kernels code refactor works on single GPU inference now
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
* hf format Signed-off-by: Chen Zhang <[email protected]> * better qkv concat Signed-off-by: Chen Zhang <[email protected]> --------- Signed-off-by: Chen Zhang <[email protected]>
* fix padding for perf Signed-off-by: Hongxia Yang <[email protected]> * simplify and refactor where to do hidden_size padding based on feedback Signed-off-by: Hongxia Yang <[email protected]> * clean up Signed-off-by: Hongxia Yang <[email protected]> --------- Signed-off-by: Hongxia Yang <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
| was executed. | ||
| """ | ||
|
|
||
| # Profiler Start and Stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if this conflicts with the profile call in EngineCore (passed to model executor)? https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/core.py#L340
When I tried something similar (#21794), it could cause engine core process to throw so let's make sure we don't break the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huachenheli Thanks for reporting it. We are merging the PR step by step, and will not include this part of code. It's only used for our debugging.
|
Hi all, Thanks for the quick work in getting this out! Unfortunately wanted to report I'm running into issues even with the standard uv pip install suggestion: uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
× No solution found when resolving dependencies:
╰─▶ Because torchaudio==2.8.0.dev20250804+cu128 depends on torch==2.9.0.dev20250804 and vllm==0.10.1+gptoss depends on
torch==2.9.0.dev20250804+cu128, we can conclude that torchaudio==2.8.0.dev20250804+cu128 and vllm==0.10.1+gptoss are incompatible.
And because vllm==0.10.1+gptoss depends on torchaudio==2.8.0.dev20250804+cu128 and you require vllm==0.10.1+gptoss, we can conclude
that the requirements are unsatisfiable.And using no cache fails in the same manner... I am trying to install from scratch next: $ uv pip install --no-cache --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/ --extra-index-url https://download.pytorch.org/whl/nightly/cu128 --index-strategy unsafe-best-match
× No solution found when resolving dependencies:
╰─▶ Because torchaudio==2.8.0.dev20250804+cu128 depends on torch==2.9.0.dev20250804 and vllm==0.10.1+gptoss depends on torch==2.9.0.dev20250804+cu128, we can conclude that
torchaudio==2.8.0.dev20250804+cu128 and vllm==0.10.1+gptoss are incompatible.
And because vllm==0.10.1+gptoss depends on torchaudio==2.8.0.dev20250804+cu128 and you require vllm==0.10.1+gptoss, we can conclude that the requirements are unsatisfiable.
(gptoss) ahmedah@skampere2:~/code/gptoss$ |
| return_success: bool = False) -> Optional[bool]: | ||
| # if expert_id is None, then | ||
| # all the experts are loaded at the same time | ||
| if not expert_id and self.quant_config.get_name() == "mxfp4": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with BF16 weight https://huggingface.co/unsloth/gpt-oss-20b-BF16/ it raised error when loading weight. Looks the common load logic for bias is missed in FusedMOE.weight_loader
|
Hi all, thanks for all the comments/reviews on this PR. Given the massive code changes, we decided to split this into small PRs. Please bear with us and report issues after we finish most of the sub-PRs. Thanks for being patient with us. Also, hardware settings we (vllm team) tested with this PR and are known to work We haven't tested on any other hardware yet, nor on other models with this PR. So please give us some more time to work on this. Great thanks!!!! vLLM Teams |
| logical_replica_count=logical_replica_count, | ||
| ) | ||
|
|
||
| return self.fused_experts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Bug report]
I encoder the issue when I enabled the EP when serve the oss-120b --enable-expert-parallel:
AttributeError: 'Mxfp4MoEMethod' object has no attribute 'fused_experts'. Did you mean: 'num_experts'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EP is not currently supported. Running TP is the best method for this PR.
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We split this PR into smaller PRs and merged considerable portion of them into the main branch.
Now Responses API and MXFP4 integration are pending. We will finish them tmr.
|
Any plans on adding Ampere support? (e.g |
|
@zyongye Hello, i try use vllm to infer gpt-oss. But the final input give to the model is: So even though I specified "Reasoning: high" in my input, the model still seems to behave as if "Reasoning: medium" is active. And the aime25(avg@32) is only 76.67. Just want to confirm whether this is intended. |
|
@dsingal0 I encountered the same error. This happens because the openai_harmony library tries to automatically download a special encoding file (e.g., o200k_base.tiktoken) from the internet, but if it cannot (due to firewalls, no internet, or missing file), it fails. |
- Add GPT-OSS model implementation (GptOssForCausalLM) - Add MXFP4 quantization support for efficient inference - Add Harmony utilities for reasoning capabilities - Add MCP tool server integration with demo tools - Add CLI argument for tool server configuration - Add example script for serving GPT-OSS with vLLM - Update model registry to include GPT-OSS - Add openai-harmony dependency for GPT-OSS features Key components: * GPT-OSS model with SwiGLU activation and RMSNorm * MXFP4 quantization method for 4-bit weights * Tool server with MCP protocol support * Harmony encoding for reasoning tokens * Example usage script with reasoning and tools This is the first part of implementing GPT-OSS support from vllm-project#22259
Major additions: - Extended OpenAI protocol with reasoning support - Added include_reasoning parameter to ChatCompletionRequest - Enhanced UsageInfo with reasoning_tokens tracking - Added reasoning field to ChatCompletionResponse - Model Context Protocol (MCP) implementation - Comprehensive test suite for GPT-OSS functionality - Production-ready example with configuration guide Features: - Full reasoning content parsing and streaming - Tool integration with MCP protocol - Token usage tracking for reasoning vs final content - Backward compatible API extensions - Complete end-to-end GPT-OSS workflow example This completes the core GPT-OSS implementation from PR vllm-project#22259
…ocumentation - Fixed missing imports (torch, time, json, re) in protocol.py - Added comprehensive implementation status documentation - Updated test and example files with latest enhancements - Ready for production deployment This completes the full GPT-OSS integration from vLLM PR vllm-project#22259
| w1_bias: Optional[torch.Tensor], | ||
| w2_bias: Optional[torch.Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these biases different from the zero points? Or, why not just use the existing _zp arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. They are added to the end of matmul per expert. I am not sure _zp has the exact functionality, and that could be across other experts. But either way, I have no plan to add these lines in main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik, the current _zp args are only used by the int4/int8 triton moe implementation in fused_moe.py.
@andresC98 Yes. Our current plan is to gradually roll down hardware support (blackwell -> hopper -> ampere). We have all the kernel integrated already to run this on Ampere. We just haven't tested end-to-end yet. |
Wheel updated: #22290 (comment) |
This is what helped by building my own cache after reading harmony code. Basically i downloaded and set end variable during docker build to avoid runtime download. |
| w2_weight = torch.nn.Parameter(torch.zeros( | ||
| num_experts, | ||
| hidden_size, | ||
| intermediate_size_per_partition_after_pad // 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QES:
- Why will
w13_weight,w13_weight_scale,w2_weightbe padded to create a bigger tensor than dim after normally tp? - Why won't
w2_biasbe padded too?
Thanks for clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are for creating weight tensors. Paddings are added for 1) kernel requirement, 2) boost performance. And inside mxfp4, only intermediate size pads are needed and w2 doesn't have that parameter. Hidden state pad are calculated in different place like here.
vllm/vllm/model_executor/layers/fused_moe/layer.py
Lines 750 to 753 in 4678503
| if (current_platform.is_rocm() | |
| or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 | |
| or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16): | |
| hidden_size = round_up(hidden_size, 256) |
| return | ||
| from triton_kernels.matmul_ogs import FlexCtx, PrecisionConfig | ||
|
|
||
| w13_bias = layer.w13_bias.to(torch.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For triton kernel path, why we promote bias to float32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specified by the triton kernel, upcast is only to comply with its requirement.
| weight = weight[ep_rank_start:ep_rank_end, ...] | ||
| else: | ||
| # (only load on rank 0 to avoid duplication) | ||
| if tp_rank != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tp_rank = 0, we load the full w2_bias. For tp_rank != 0, we just set it to zero. Won't this cause problems of computing? Or we will broadcast to other ranks after rank=0 loading? Actually from the runtime, I found these values for rank!=0 are indeed zeros. Thanks for clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MoE layer will perform allreduce at the very end. So if we load w2_bias for every rank. It will be added multiple times, which is not correct.
| ): | ||
| super().__init__() | ||
| self.vllm_config = vllm_config | ||
| self.model_config = vllm_config.model_config.hf_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here model_config is used instead of config in other models. Would this affect LoRA compatibility?
For example, in qwen3:
vllm/vllm/model_executor/models/qwen3.py
Line 283 in 1891a26
| self.config = config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't explored adding LoRA yet.
|
Hi does this work with rtx pro 6000 blackwells? or 50 series blackwells. I think they are both sm120 |
|
Close this issue since we have merged all of it for release 0.10.1 |
This doc is constantly updated
(Please READ!!) If you want to run this model out-of-box, please follow our recipes or install a custom wheel. This guide is for users who want to build the environment from scratch and do customization on the model. We will slowly merge this PR until we think it is mostly compatible with existing dependencies. Note that current commit branch is only tested against gpt-oss model. Other model may have unexpected behavior.
To download this commit from wheel
uv pip install --pre vllm==0.10.1+gptoss \ --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ --index-strategy unsafe-best-matchInstallation Steps
uv venv source .venv/bin/activate uv pip install mcppytorch-tritonuv pip install "transformers[torch]"vllm serveKnown Issue
663e04e8e3ebed7ee3230a1a7320142689795106should contain all the feature needed to run this model and yet compatible with any pytorch nightly.Dependency track
v4.55#21931)