Skip to content

Conversation

@zyongye
Copy link
Member

@zyongye zyongye commented Aug 5, 2025

This doc is constantly updated

(Please READ!!) If you want to run this model out-of-box, please follow our recipes or install a custom wheel. This guide is for users who want to build the environment from scratch and do customization on the model. We will slowly merge this PR until we think it is mostly compatible with existing dependencies. Note that current commit branch is only tested against gpt-oss model. Other model may have unexpected behavior.

To download this commit from wheel

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

Installation Steps

  1. Create new virtualenv in whatever mechanism:
uv venv
source .venv/bin/activate
uv pip install mcp
  1. Install Pytorch Nightly(Optional for Blackwell user)
    • After installation, be sure to uninstall pytorch-triton
uv pip uninstall pytorch-triton
  1. Download Huggingface newest transformer wheel
uv pip install "transformers[torch]"
  1. Download OpenAI triton and install it and triton_kernels (Optional for Blackwell user)
git clone https://github.com/openai/triton.git
pushd triton
uv pip install -r python/requirements.txt
uv pip install -e . --verbose --no-build-isolation
uv pip install -e python/triton_kernels --no-deps
popd
  1. Install new FlashInfer repo (Mandatory for Blackwell user, optional for Hopper user)
uv pip install flashinfer-python==0.2.10
  1. Clone the repo and checkout this PR and build
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 6a70830065701b163e36a86fd331b41b5feac401
python use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install -U -e . --verbose --no-build-isolation
  1. And let run vllm serve
# On NVIDIA Hopper
vllm serve openai/gpt-oss-120b -tp2 --async-scheduling
# On NVIDIA Blackwell
VLLM_USE_TRTLLM_ATTENTION=1 \
VLLM_USE_TRTLLM_DECODE_ATTENTION=1 \
VLLM_USE_TRTLLM_CONTEXT_ATTENTION=1 \
VLLM_USE_FLASHINFER_MXFP4_MOE=1 \
vllm serve openai/gpt-oss-120b -tp2 --async-scheduling

Known Issue

  1. PyTorch and Triton incompatibility
    • This build relies on triton and pytorch features that are not in any stable release yet. So these two packages may cause conflict. More specifically, pytorch nightly may not work with Triton main branch. If that's the case, please revert triton commit until you find working. From our experience, triton commit 663e04e8e3ebed7ee3230a1a7320142689795106 should contain all the feature needed to run this model and yet compatible with any pytorch nightly.
  2. Default memory utilization and batch size will cause CUDA OOM for tp1 on H100. Please increase gpu memory utilization or lower batch size
vllm serve openai/gpt-oss-120b --gpu-memory-utilization 0.95 --max-num-batched-tokens 512
  1. On H100 with tp2, prevent gpu memory utilization from being too high. (0.95 will cause OOM)

Dependency track

LiuXiaoxuanPKU and others added 30 commits July 25, 2025 16:27
simon env adjustment

Signed-off-by: simon-mo <[email protected]>

fix tokenizer and able to startup

Signed-off-by: simon-mo <[email protected]>

e2e runnable

Signed-off-by: simon-mo <[email protected]>
shuffled version working in unit test

bug fix

have fp8 support on unit test

add mxfp4 quant method, implementation is using fp8 until better kernel understanding

change config name to be compatible with hf config, model runnable

mxfp4 class working, everything still in bf16

preliminary mxfp4 tests

Revert "change config name to be compatible with hf config, model runnable"

This reverts commit 736bf907fc2f7b028b171b402c19034c0e43c6e8.

integrate into model, still bf16

mxfp4 kernel works somehow

clean up assertions

experimental mxfp4 kernel in vllm

cleanup intermediate tensor os the model can run with tp=8

use exact dtype when loading

update tests

adding swizzle padding in test

implement padding to enable hbm_swizzling

move quantization to weight loader

remove activation padding, only pad the weight

remove preallocated tensor in mxfp4 moe method, model can be run with tp=1

move bias post processing after loading to save memory

move bias addition to rank 0

formatting

verified working
Signed-off-by: simon-mo <[email protected]>

weight loading cleanup

Signed-off-by: simon-mo <[email protected]>

rename oai -> openaimoe for HF compat

Signed-off-by: simon-mo <[email protected]>

format

Signed-off-by: simon-mo <[email protected]>

finished rebase

Signed-off-by: simon-mo <[email protected]>
Reduce weight padding since it is handled inside convert_layout function in triton_kernels
code refactor
works on single GPU inference now
Signed-off-by: simon-mo <[email protected]>
* hf format

Signed-off-by: Chen Zhang <[email protected]>

* better qkv concat

Signed-off-by: Chen Zhang <[email protected]>

---------

Signed-off-by: Chen Zhang <[email protected]>
* fix padding for perf

Signed-off-by: Hongxia Yang <[email protected]>

* simplify and refactor where to do hidden_size padding based on feedback

Signed-off-by: Hongxia Yang <[email protected]>

* clean up

Signed-off-by: Hongxia Yang <[email protected]>

---------

Signed-off-by: Hongxia Yang <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
was executed.
"""

# Profiler Start and Stop
Copy link
Contributor

@huachenheli huachenheli Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this conflicts with the profile call in EngineCore (passed to model executor)? https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/core.py#L340

When I tried something similar (#21794), it could cause engine core process to throw so let's make sure we don't break the existing one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huachenheli Thanks for reporting it. We are merging the PR step by step, and will not include this part of code. It's only used for our debugging.

@ahmeda14960
Copy link

ahmeda14960 commented Aug 6, 2025

Hi all,

Thanks for the quick work in getting this out! Unfortunately wanted to report I'm running into issues even with the standard uv pip install suggestion:

uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match
  × No solution found when resolving dependencies:
  ╰─▶ Because torchaudio==2.8.0.dev20250804+cu128 depends on torch==2.9.0.dev20250804 and vllm==0.10.1+gptoss depends on
      torch==2.9.0.dev20250804+cu128, we can conclude that torchaudio==2.8.0.dev20250804+cu128 and vllm==0.10.1+gptoss are incompatible.
      And because vllm==0.10.1+gptoss depends on torchaudio==2.8.0.dev20250804+cu128 and you require vllm==0.10.1+gptoss, we can conclude
      that the requirements are unsatisfiable.

And using no cache fails in the same manner... I am trying to install from scratch next:

     $ uv pip install --no-cache --pre vllm==0.10.1+gptoss       --extra-index-url https://wheels.vllm.ai/gpt-oss/       --extra-index-url https://download.pytorch.org/whl/nightly/cu128       --index-strategy unsafe-best-match
  × No solution found when resolving dependencies:
  ╰─▶ Because torchaudio==2.8.0.dev20250804+cu128 depends on torch==2.9.0.dev20250804 and vllm==0.10.1+gptoss depends on torch==2.9.0.dev20250804+cu128, we can conclude that
      torchaudio==2.8.0.dev20250804+cu128 and vllm==0.10.1+gptoss are incompatible.
      And because vllm==0.10.1+gptoss depends on torchaudio==2.8.0.dev20250804+cu128 and you require vllm==0.10.1+gptoss, we can conclude that the requirements are unsatisfiable.
(gptoss) ahmedah@skampere2:~/code/gptoss$ 

return_success: bool = False) -> Optional[bool]:
# if expert_id is None, then
# all the experts are loaded at the same time
if not expert_id and self.quant_config.get_name() == "mxfp4":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with BF16 weight https://huggingface.co/unsloth/gpt-oss-20b-BF16/ it raised error when loading weight. Looks the common load logic for bias is missed in FusedMOE.weight_loader

@zyongye
Copy link
Member Author

zyongye commented Aug 6, 2025

Hi all, thanks for all the comments/reviews on this PR. Given the massive code changes, we decided to split this into small PRs. Please bear with us and report issues after we finish most of the sub-PRs. Thanks for being patient with us.

Also, hardware settings we (vllm team) tested with this PR and are known to work
NVIDIA H100, H200, B200 with TP 1, 2, 4, 8
(EP is currently not supported, at least not through tested)

We haven't tested on any other hardware yet, nor on other models with this PR. So please give us some more time to work on this. Great thanks!!!!

vLLM Teams

logical_replica_count=logical_replica_count,
)

return self.fused_experts(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug report]
I encoder the issue when I enabled the EP when serve the oss-120b --enable-expert-parallel:
AttributeError: 'Mxfp4MoEMethod' object has no attribute 'fused_experts'. Did you mean: 'num_experts'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EP is not currently supported. Running TP is the best method for this PR.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We split this PR into smaller PRs and merged considerable portion of them into the main branch.
Now Responses API and MXFP4 integration are pending. We will finish them tmr.

@andresC98
Copy link

andresC98 commented Aug 6, 2025

Any plans on adding Ampere support? (e.g NVIDIA A100 gpus)
#22290

@dongZheX
Copy link

dongZheX commented Aug 7, 2025

@zyongye Hello, i try use vllm to infer gpt-oss.
Now i sent a message containing system to control the reasoning effort,

{
    "model": "gpt-oss-120b",
    "messages":[
        {
            "role": "system",
            "content": [{"type": "text", "text": "Reasoning: high"}]
        },
        {
            "role": "user", 
            "content": [
                {"type": "text", "text": "Let $ABCDEF$ be a convex equilateral hexagon in which all pairs of opposite sides are parallel. The triangle whose sides are extensions of segments $\\overline{AB}$ , $\\overline{CD}$ , and $\\overline{EF}$ has side lengths $200, 240,$ and $300$ . Find the side length of the hexagon.\n\nPlease reason step by step, and put your final answer within \\boxed{}."}
            ]
        }
    ],
    "temperature": 0.0001,
    "max_tokens": 1,
    "top_p": 0.001,
    "logprobs": true,
    "echo": true
}

But the final input give to the model is:

system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-07

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|><|end|><|start|>system<|message|>Reasoning: high<|end|><|start|>user<|message|>Let $ABCDEF$ be a convex equilateral hexagon in which all pairs of opposite sides are parallel. The triangle whose sides are extensions of segments $\overline{AB}$ , $\overline{CD}$ , and $\overline{EF}$ has side lengths $200, 240,$ and $300$ . Find the side length of the hexagon.

Please reason step by step, and put your final answer within \boxed{}.<|end|><|start|>assistant                                                                                                   

So even though I specified "Reasoning: high" in my input, the model still seems to behave as if "Reasoning: medium" is active. And the aime25(avg@32) is only 76.67.

Just want to confirm whether this is intended.

@fanjikang
Copy link

@dsingal0 I encountered the same error. This happens because the openai_harmony library tries to automatically download a special encoding file (e.g., o200k_base.tiktoken) from the internet, but if it cannot (due to firewalls, no internet, or missing file), it fails.
1.Manually Download the Tokenizer
wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken mv o200k_base.tiktoken fb374d419588a4632f3f557e76b4b70aebbca790 (The new filename is the SHA1 hash that tiktoken/openai_harmony expects for this encoding.)
2.Set the Cache Directory for tiktoken/openai_harmony
export TIKTOKEN_RS_CACHE_DIR=/your/path/

jthakurH added a commit to HabanaAI/vllm-fork that referenced this pull request Aug 7, 2025
- Add GPT-OSS model implementation (GptOssForCausalLM)
- Add MXFP4 quantization support for efficient inference
- Add Harmony utilities for reasoning capabilities
- Add MCP tool server integration with demo tools
- Add CLI argument for tool server configuration
- Add example script for serving GPT-OSS with vLLM
- Update model registry to include GPT-OSS
- Add openai-harmony dependency for GPT-OSS features

Key components:
* GPT-OSS model with SwiGLU activation and RMSNorm
* MXFP4 quantization method for 4-bit weights
* Tool server with MCP protocol support
* Harmony encoding for reasoning tokens
* Example usage script with reasoning and tools

This is the first part of implementing GPT-OSS support from
vllm-project#22259
jthakurH added a commit to HabanaAI/vllm-fork that referenced this pull request Aug 7, 2025
Major additions:
- Extended OpenAI protocol with reasoning support
- Added include_reasoning parameter to ChatCompletionRequest
- Enhanced UsageInfo with reasoning_tokens tracking
- Added reasoning field to ChatCompletionResponse
- Model Context Protocol (MCP) implementation
- Comprehensive test suite for GPT-OSS functionality
- Production-ready example with configuration guide

Features:
- Full reasoning content parsing and streaming
- Tool integration with MCP protocol
- Token usage tracking for reasoning vs final content
- Backward compatible API extensions
- Complete end-to-end GPT-OSS workflow example

This completes the core GPT-OSS implementation from PR vllm-project#22259
jthakurH added a commit to HabanaAI/vllm-fork that referenced this pull request Aug 7, 2025
…ocumentation

- Fixed missing imports (torch, time, json, re) in protocol.py
- Added comprehensive implementation status documentation
- Updated test and example files with latest enhancements
- Ready for production deployment

This completes the full GPT-OSS integration from vLLM PR vllm-project#22259
Comment on lines +1846 to +1847
w1_bias: Optional[torch.Tensor],
w2_bias: Optional[torch.Tensor],
Copy link
Collaborator

@bnellnm bnellnm Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these biases different from the zero points? Or, why not just use the existing _zp arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. They are added to the end of matmul per expert. I am not sure _zp has the exact functionality, and that could be across other experts. But either way, I have no plan to add these lines in main

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik, the current _zp args are only used by the int4/int8 triton moe implementation in fused_moe.py.

@zyongye
Copy link
Member Author

zyongye commented Aug 7, 2025

Any plans on adding Ampere support? (e.g NVIDIA A100 gpus) #22290

@andresC98 Yes. Our current plan is to gradually roll down hardware support (blackwell -> hopper -> ampere). We have all the kernel integrated already to run this on Ampere. We just haven't tested end-to-end yet.

@LucasWilkinson
Copy link
Collaborator

Any plans on adding Ampere support? (e.g NVIDIA A100 gpus) #22290

Wheel updated: #22290 (comment)

@rohitharkhani
Copy link

using the vllm/vllm-openai:gptoss image gives the following error

(APIServer pid=16)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_responses.py", line 130, in __init__ 

(APIServer pid=16)     get_stop_tokens_for_assistant_actions()) 

(APIServer pid=16)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 

(APIServer pid=16)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/harmony_utils.py", line 187, in get_stop_tokens_for_assistant_actions 

(APIServer pid=16)     return get_encoding().stop_tokens_for_assistant_actions() 

(APIServer pid=16)            ^^^^^^^^^^^^^^ 

(APIServer pid=16)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/harmony_utils.py", line 37, in get_encoding 

(APIServer pid=16)     _harmony_encoding = load_harmony_encoding( 

(APIServer pid=16)                         ^^^^^^^^^^^^^^^^^^^^^^ 

(APIServer pid=16)   File "/usr/local/lib/python3.12/dist-packages/openai_harmony/__init__.py", line 674, in load_harmony_encoding 

(APIServer pid=16)     inner: _PyHarmonyEncoding = _load_harmony_encoding(name) 

(APIServer pid=16)                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 

(APIServer pid=16) openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab file 

This is because the openai-harmony Python module used in vllm attempts to download the tiktoken file from the Internet.

Tracking OpenAI's Harmony, it ultimately executes the following Rust code. https://github.com/openai/harmony/blob/9528c7b4a00a3307fd9685fc1328aee11c3d9c90/src/tiktoken_ext/public_encodings.rs#L417

Looking at the code above, you can see that if the file is cached, it does not download it but reads it from the cache. As a temporary solution, I downloaded the tiktoken file in a Linux environment with internet access and copied the cached directory to resolve the issue.

This is what helped by building my own cache after reading harmony code.

Basically i downloaded and set end variable during docker build to avoid runtime download.

RUN mkdir /vllm-workspace/tiktoken
RUN wget -O /vllm-workspace/tiktoken/o200k_base.tiktoken https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
RUN wget -O /vllm-workspace/tiktoken/cl100k_base.tiktoken https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
ENV TIKTOKEN_ENCODINGS_BASE /vllm-workspace/tiktoken/

@mergify mergify bot added the gpt-oss Related to GPT-OSS models label Aug 11, 2025
w2_weight = torch.nn.Parameter(torch.zeros(
num_experts,
hidden_size,
intermediate_size_per_partition_after_pad // 2,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QES:

  1. Why will w13_weight, w13_weight_scale, w2_weight be padded to create a bigger tensor than dim after normally tp?
  2. Why won't w2_bias be padded too?

Thanks for clarification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are for creating weight tensors. Paddings are added for 1) kernel requirement, 2) boost performance. And inside mxfp4, only intermediate size pads are needed and w2 doesn't have that parameter. Hidden state pad are calculated in different place like here.

if (current_platform.is_rocm()
or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8
or envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16):
hidden_size = round_up(hidden_size, 256)

return
from triton_kernels.matmul_ogs import FlexCtx, PrecisionConfig

w13_bias = layer.w13_bias.to(torch.float32)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For triton kernel path, why we promote bias to float32?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specified by the triton kernel, upcast is only to comply with its requirement.

weight = weight[ep_rank_start:ep_rank_end, ...]
else:
# (only load on rank 0 to avoid duplication)
if tp_rank != 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tp_rank = 0, we load the full w2_bias. For tp_rank != 0, we just set it to zero. Won't this cause problems of computing? Or we will broadcast to other ranks after rank=0 loading? Actually from the runtime, I found these values for rank!=0 are indeed zeros. Thanks for clarification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MoE layer will perform allreduce at the very end. So if we load w2_bias for every rank. It will be added multiple times, which is not correct.

):
super().__init__()
self.vllm_config = vllm_config
self.model_config = vllm_config.model_config.hf_config
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here model_config is used instead of config in other models. Would this affect LoRA compatibility?
For example, in qwen3:

self.config = config

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't explored adding LoRA yet.

@fernandaspets
Copy link

Hi does this work with rtx pro 6000 blackwells? or 50 series blackwells. I think they are both sm120

@zyongye
Copy link
Member Author

zyongye commented Aug 25, 2025

Close this issue since we have merged all of it for release 0.10.1

@zyongye zyongye closed this Aug 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models needs-rebase new-model Requests to new models rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.