[bugfix] /chat/completion doesn't read extra_body for diffusion model by fhfuih · Pull Request #2042 · vllm-project/vllm-omni

fhfuih · 2026-03-20T09:23:46Z

Purpose

The online endpoints currently fail to read extra_body when it routes to the diffusion mode. However, the Omni mode is normal.

For some reason, the output image's width and height is still correct... BUT at least the number of inference steps are not respected. Other params are not investigated, but this is sufficient to mark this issue.

This implicitly causes more severe problem after PR #1979 merged. Whether to step a cache backend now depends on the number of inference steps, and the number of inference steps is strictly read from user input, with no default values. As the online mode cannot read this property successfully, somehow it led Qwen-Image-Edit-2509 to crash with TeaCache (#2036).

I haven't investigated why Qwen-Image-Edit-2509 runs well with cachedit, and why Qwen-Image-Edit or some other models run fine with both cache backend... This is also peculiar

The cause is noted as comments in the source code. It is jointly caused by vLLM's data preprocessing at the API layer and the further implicit Pydantic behavior.

Test Plan

Baseline: Without this PR

To reproduce #2036

Run this test pytest tests/e2e/online_serving/test_qwen_image_edit_expansion.py -k 'test_qwen_image_edit_2509[single_card_001]' -s -v

It should fail, as noted in #2036.

To also see the number of inference steps:

serve model with vllm serve /home/models/Qwen/Qwen-Image-Edit-2509 --omni --host 127.0.0.1 --port 38205 --stage-init-timeout 120 (without cache backend)

Then send a request with extra_body

curl -s http://localhost:38205/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "A beautiful landscape painting"}
    ],
    "extra_body": {
      "height": 1024,
      "width": 1024,
      "num_inference_steps": 2,
      "true_cfg_scale": 4.0,
      "seed": 42
    }
  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png

You should see:

(There will be a dummy run progress bar, always 1/1
...
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos, Methods: GET
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos/{video_id}, Methods: GET
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos/{video_id}, Methods: DELETE
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos/{video_id}/content, Methods: GET
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:58] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=14519) INFO:     Started server process [14519]
(APIServer pid=14519) INFO:     Waiting for application startup.
(APIServer pid=14519) INFO:     Application startup complete.
Server ready on 127.0.0.1:54525
OmniServer started successfully
...
(request received)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:30<00:00,  1.64it/s]

Still 50 steps.

Plus, if you add additional logger.info(f"@@@@@@@@ extra body is {extra_body}") in _create_diffusion_chat_completion, you will see it prints None.

Test Result

After this PR, the above pytest case can pass on my side. Plus, the inference progress bar now shows

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a59b1d511

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T09:29:14Z

+            extra_body = request.model_extra or {}

            # Parse size if provided (supports "1024x1024" format)
            height = extra_body.get("height")


Preserve nested extra_body payloads for raw HTTP callers

This assumes custom generation args are always flattened into request.model_extra, but our documented non-SDK clients still send a nested {"extra_body": {...}} object (for example docs/user_guide/examples/online_serving/text_to_image.md:58-71 and examples/online_serving/glm_image/openai_chat_client.py:68-79). Because ChatCompletionRequest accepts arbitrary extra fields, that shape arrives as request.extra_body / request.model_extra["extra_body"], so the following .get("height"), .get("num_inference_steps"), etc. all return None. After this change, curl/requests.post(json=payload) callers silently lose their generation parameters in this diffusion path and in the identical block above.

Useful? React with 👍 / 👎.

False positive. Reason already explained in comment. requests.post(json=payload) only received the correct params after this PR

Copilot

Pull request overview

Fixes online chat/diffusion serving not honoring extra_body parameters by reading Pydantic “extra” fields after vLLM’s request preprocessing removes the nested extra_body.

Changes:

Switches parameter extraction from request.extra_body to request.model_extra for online serving.
Adds inline comments explaining vLLM preprocessing + Pydantic behavior behind the issue.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

wtomin · 2026-03-20T09:49:41Z

It is critical. Can I trigger the CI to check if the error of #2036 is fixed. Change it to ready-merge tag?

fhfuih · 2026-03-20T09:52:48Z

It is critical. Can I trigger the CI to check if the error of #2036 is fixed. Change it to ready-merge tag?

Yeah, please go ahead. Meanwhile I am working on attaching more logs.

fhfuih · 2026-03-20T09:57:28Z

More logs

Diffusion

adding the following snippet to async def _create_diffusion_chat_completion (vllm_omni/entrypoints/openai/serving_chat.py, around Line 2020)

            logger.info(f"@@@@@@@@@@ request is {request}")
            logger.info(f"@@@@@@@@@@ raw request is {await raw_request.json()}")
            logger.info(f"@@@@@@@@@@ correct extra_body from request.model_extra is {request.model_extra}")
            logger.info(f"@@@@@@@@@@ wrong extra_body from `getattr(request, 'extra_body') or ｛｝` is {getattr(request, 'extra_body') or {}}")
            logger.info(f'\n@@@@@\n'.join(f"{name}: {x}" for x, name in zip([
                num_inference_steps,
                guidance_scale,
                true_cfg_scale,
                seed,
                negative_prompt,
                num_outputs_per_prompt,
                num_frames,
                guidance_scale_2,
                lora_body,
                layers,
                resolution,
            ], [
                "num_inference_steps",
                "guidance_scale",
                "true_cfg_scale",
                "seed",
                "negative_prompt",
                "num_outputs_per_prompt",
                "num_frames",
                "guidance_scale_2",
                "lora_body",
                "layers",
                "resolution",
            ])))

and we get

(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2082] @@@@@@@@@@ request is messages=... model='/home/models/Qwen/Qwen-Image-Edit-2509' frequency_penalty=0.0 logit_bias=None logprobs=False top_logprobs=0 max_tokens=None max_completion_tokens=None n=1 presence_penalty=0.0 response_format=None seed=42 stop=[] stream=False stream_options=None temperature=None top_p=None tools=None tool_choice='none' reasoning_effort=None include_reasoning=True parallel_tool_calls=True user=None use_beam_search=False top_k=None min_p=None repetition_penalty=None length_penalty=1.0 stop_token_ids=[] include_stop_str_in_output=False ignore_eos=False min_tokens=0 skip_special_tokens=True spaces_between_special_tokens=True truncate_prompt_tokens=None prompt_logprobs=None allowed_token_ids=None bad_words=[] echo=False add_generation_prompt=True continue_final_message=False add_special_tokens=False documents=None chat_template=None chat_template_kwargs=None mm_processor_kwargs=None structured_outputs=None priority=0 request_id='86aa19c604379ef7' return_tokens_as_token_ids=None return_token_ids=None cache_salt=None kv_transfer_params=None vllm_xargs=None repetition_detection=None height=512 width=512 num_inference_steps=2 negative_prompt='blurry, low quality, modern, geometrist' true_cfg_scale=4.0
(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2083] @@@@@@@@@@ raw request is {'messages': ..., 'model': '/home/models/Qwen/Qwen-Image-Edit-2509', 'height': 512, 'width': 512, 'num_inference_steps': 2, 'negative_prompt': 'blurry, low quality, modern, geometrist', 'true_cfg_scale': 4.0, 'seed': 42}
(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2084] @@@@@@@@@@ correct extra_body from request.model_extra is {'height': 512, 'width': 512, 'num_inference_steps': 2, 'negative_prompt': 'blurry, low quality, modern, geometrist', 'true_cfg_scale': 4.0}
(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2085] @@@@@@@@@@ wrong extra_body from `getattr(request, 'extra_body', None) or ｛｝` is {}

Note that request being a subclass of vLLM/Pydantic class has custom __str__ print formats. It does have all fields in the top-level. If you try directly getattr on extra args, it will still fail. Has to access them via model_extra

Omni (llm)

⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️ note that it is completely the other way around

Adding the following snippet to async def create_chat_completion (vllm_omni/entrypoints/openai/serving_chat.py, around line 312)

                logger.info(f"~~~~~~~~~~~~ request is {request}")
                logger.info(f"~~~~~~~~~~~~ raw request is {await raw_request.json()}")
                logger.info(f"~~~~~~~~~~~~ correct extra_body from request.model_extra is {request.model_extra}")
                logger.info(f"~~~~~~~~~~~~ wrong extra_body from `getattr(request, 'extra_body', None) or ｛｝` is {getattr(request, 'extra_body', None) or {}}")
                logger.info(f'\n~~~~~~~~~~~~\n'.join(f"{name}: {x}" for x, name in zip([
                    height,
                    width,
                    negative_prompt,
                ], [
                    "height",
                    "width",
                    "negative_prompt",
                ])))

And launch Bagel for example.

vllm  serve /home/models/ByteDance-Seed/BAGEL-7B-MoT --omni --host 127.0.0.1 --port 44791 --stage-init-timeout 120 --stage-init-timeout 300

Send request

curl -s http://localhost:44791/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "A beautiful landscape painting"}
    ], "modalities": ["image"],
    "extra_body": {
      "height": 256,
      "width": 256,
      "num_inference_steps": 2,
      "true_cfg_scale": 4.0,
      "seed": 42
    }
  }' | tee /dev/stderr | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png

And we get the following

(APIServer pid=34558) INFO 03-20 10:37:15 [serving_chat.py:315] ~~~~~~~~~~~~ raw request is {'messages': [{'role': 'user', 'content': 'A beautiful landscape painting'}], 'modalities': ['image'], 'extra_body': {'height': 256, 'width': 256, 'num_inference_steps': 2, 'true_cfg_scale': 4.0, 'seed': 42}}
(APIServer pid=34558) INFO 03-20 10:37:15 [serving_chat.py:316] ~~~~~~~~~~~~ correct extra_body from request.model_extra is {'modalities': ['image'], 'extra_body': {'height': 256, 'width': 256, 'num_inference_steps': 2, 'true_cfg_scale': 4.0, 'seed': 42}}
(APIServer pid=34558) INFO 03-20 10:37:15 [serving_chat.py:317] ~~~~~~~~~~~~ wrong extra_body from `getattr(request, 'extra_body', None) or ｛｝` is {'height': 256, 'width': 256, 'num_inference_steps': 2, 'true_cfg_scale': 4.0, 'seed': 42}

In this case, the new way to get extra_body is actually wrong (because there will be a nested extra_body dict). I haven't found out why 😂 😂

Conclusion

Only make the change to the diffusion /chat/completion handler

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

SamitHuang · 2026-03-20T11:04:59Z

+            # In addition, these extra attrs are hidden as the default behavior of Pydantic `BaseModel`
+            #   (which `ChatCompletionRequest` inherits from, and these fields are not explicitly defined).
+            # They are ONLY accessible via model_extra property. Cannot get via getattr(request, "num_inference_steps")
+            extra_body = request.model_extra or {}


why do we have mode_extra field here, which does not present in the curl request

It is an attribute of Pydantic BaseModel, which is a base class of vllm's ChatCompletionRequest, which is the type of the request parameter.

(Pydantic offers an enhanced dataclass with runtime type validation)

https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_extra

This attribute stores all the fields that are provided during the init of a Pydantic dataclass and not explicitly defined inside the dataclass.

The content of the extra_body dict itself is somehow extracted and then merged into the kwargs that initialized this request dataclass (at least in some occassions), so the values goes there

i see. So the reasonable fix will be handling the curl nested-dict format at first getattr(request, "extra_body", None), if None, check the SDK-flattened params stored in request.model_extra

SamitHuang · 2026-03-20T11:18:03Z

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

vllm serve "${MODEL:-Qwen/Qwen-Image-Edit}" --omni --port "${PORT:-8092}"

#!/usr/bin/env bash
# POST /v1/chat/completions with image + text for Qwen-Image-Edit (text first, then image).
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
INPUT_IMG="${1:-$SCRIPT_DIR/rabbit.png}"
PROMPT="${2:-Convert this image to watercolor painting style}"
URL="${QIE_URL:-http://localhost:8092/v1/chat/completions}"

if [[ ! -f "$INPUT_IMG" ]]; then
  echo "Image not found: $INPUT_IMG" >&2
  exit 1
fi

base64 -w0 "$INPUT_IMG" | jq -Rs --arg model "Qwen/Qwen-Image-Edit" --arg prompt "$PROMPT" '{
  model: $model,
  messages: [{
    role: "user",
    content: [
      {type: "text", text: $prompt},
      {type: "image_url", image_url: {url: ("data:image/png;base64," + .)}}
    ]
  }],
  extra_body: {
    num_inference_steps: 2,
    guidance_scale: 1.0,
    true_cfg_scale: 4.0,
    seed: 0
  }
}' | curl -sS -X POST "$URL" \
  -H "Content-Type: application/json" \
  -d @- \
  --max-time 600

Output: correct using 2 steps

(APIServer pid=170411) WARNING 03-20 11:16:28 [protocol.py:51] The following fields were present in the request but ignored: {'extra_body'}
(APIServer pid=170411) INFO 03-20 11:16:28 [serving_chat.py:2073] Diffusion chat request chatcmpl-5d4faf419b3e4259: prompt='Convert this image to watercolor painting style', ref_images=1, params={'num_inference_steps': 2, 'guidance_scale': 1.0, 'true_cfg_scale': 4.0, 'seed': 0}
(APIServer pid=170411) INFO 03-20 11:16:28 [orchestrator.py:577] [Orchestrator] _handle_add_request: stage=0 req=chatcmpl-5d4faf419b3e4259 prompt_type=dict original_prompt_type=dict final_stage=0 num_sampling_params=1
(APIServer pid=170411) INFO 03-20 11:16:28 [diffusion_engine.py:87] Pre-processing completed in 0.0478 seconds
INFO 03-20 11:16:28 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-20 11:16:28 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 11:16:28 [pipeline_qwen_image_edit.py:640] negative_prompt is not set. The official Qwen-Image-Edit model may produce lower-quality results without a negative_prompt. Qwen official repository recommends to use whitespace string as negative_prompt. Note: some distilled variants may not be affected by this.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.86it/s]
WARNING 03-20 11:16:29 [diffusion_worker.py:385] SHM pack failed, falling back to raw enqueue: Got unsupported ScalarType BFloat16
(APIServer pid=170411) INFO 03-20 11:16:29 [diffusion_engine.py:95] Generation completed successfully.
(APIServer pid=170411) INFO 03-20 11:16:29 [diffusion_engine.py:128] Post-processing completed in 0.0479 seconds
(APIServer pid=170411) INFO 03-20 11:16:29 [diffusion_engine.py:131] DiffusionEngine.step breakdown: preprocess=47.78 ms, add_req_and_wait=1011.50 ms, postprocess=47.86 ms, total=1107.69 ms
(APIServer pid=170411) INFO 03-20 11:16:29 [omni_base.py:154] [Summary] {}
(APIServer pid=170411) INFO 03-20 11:16:30 [serving_chat.py:2238] Diffusion chat completed for request chatcmpl-5d4faf419b3e4259: 1 images

fhfuih · 2026-03-20T14:36:59Z

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

Hmm... 🤔 Then maybe I need to invesigate this matter further. Maybe there are some additional conditions that triggers this data preprocessing of the request object. (And that's maybe why Qwen-Image-Edit didn't break in nightly test after #1979 is merged

SamitHuang · 2026-03-20T16:05:29Z

+            # In addition, these extra attrs are hidden as the default behavior of Pydantic `BaseModel`
+            #   (which `ChatCompletionRequest` inherits from, and these fields are not explicitly defined).
+            # They are ONLY accessible via model_extra property. Cannot get via getattr(request, "num_inference_steps")
+            extra_body = request.model_extra or {}


i see. So the reasonable fix will be handling the curl nested-dict format at first getattr(request, "extra_body", None), if None, check the SDK-flattened params stored in request.model_extra

Signed-off-by: Samit <285365963@qq.com>

SamitHuang · 2026-03-20T16:33:25Z

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

Hmm... 🤔 Then maybe I need to invesigate this matter further. Maybe there are some additional conditions that triggers this data preprocessing of the request object. (And that's maybe why Qwen-Image-Edit didn't break in nightly test after #1979 is merged

The code expect the users to run curl with "extra_body" as a nested JSON key. I think I can

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

Hmm... 🤔 Then maybe I need to invesigate this matter further. Maybe there are some additional conditions that triggers this data preprocessing of the request object. (And that's maybe why Qwen-Image-Edit didn't break in nightly test after #1979 is merged

Figured it out. It's because I used curl with "extra_body" as a nested JSON key, while you used SDK-flattend format. The latest commits will make curl-nested, SDK-flattened, AND top-level params all work transparently.

Btw, I think we need to update the online serving docs to reduce users' confusion on this part. I will do in #2051

Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>

fhfuih · 2026-03-21T03:56:12Z

Ohh, so the flattening happends in the client side, using OpenAI client---not on the server side. That's why I didn't see relevant code. I will also take a look at your commit above. Thanks for the clarification

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih · 2026-03-21T04:19:26Z

That also explains why I cannot reproduce the same behavior when testing the BAGEL model late yesterday, because I was sending curl requests.

So, I also added the same logic to Omni version of the api handler just now. Together with updated comments to match this discovery

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

SamitHuang

LGTM now

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

SamitHuang · 2026-03-21T07:18:15Z

+            # [NOTE] When sending request from openai client Python library (.chat.completions.create),
+            #   `extra_body` argument is flattented and merged into the payload's root.
+            #   and there is no more `extra_body` attribute in request.
+            # Since these fields are not declared in
+            #   vllm.entrypoints.openai.chat_completion.protocol.ChatCompletionRequest, and this is a
+            #   Pydantic `BaseModel` with runtime validation, these fields are not directly accessible as
+            #   `request` attributes, but only via `model_extra` property.
+            # When sending raw request with curl, this flattening does not occur.
+            #   We directly get the `extra_body` dict.


can we make these coments more concise?

Done, please check again~

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih · 2026-03-21T07:27:28Z

nightly test result is at https://buildkite.com/vllm/vllm-omni/builds/4618/steps/canvas?sid=019d0ea3-9a38-4e34-8bd9-7d3c613b902c

Temporary CI modifications are reverted

### vllm-omni-audio-tts - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-perf - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-api - Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection - Changes: - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection ### vllm-omni-contrib - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-cicd - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-api - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-perf - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-contrib - Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 ### vllm-omni-serving - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-contrib - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-api - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0` ### vllm-omni-cicd - Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test. - Changes: - Bug fix: [CI] Fix test. ### vllm-omni-cicd - Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml` ### vllm-omni-cicd - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-perf - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-serving - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-image-gen - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-perf - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-serving - Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni - Changes: - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni ### vllm-omni-image-gen - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images - Additions: - Qwen-Image-Layered - Qwen-Image-Layered - Qwen-Image-Layered ### vllm-omni-api - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images ### vllm-omni-cicd - Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) ### vllm-omni-serving - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-audio-tts - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-perf - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-serving - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-api - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-serving - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-cicd - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-api - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Additions: - `/v1/chat/completions` ### vllm-omni-perf - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) ### vllm-omni-contrib - Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case) - Changes: - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case) ### vllm-omni-contrib - Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support ### vllm-omni-cicd - Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash - Changes: - Bug fix: Fix Base voice clone streaming quality and stop-token crash ### vllm-omni-cicd - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models - Changes: - Bug fix: Fix OmniGen2 transformer config loading for HF models ### vllm-omni-audio-tts - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-perf - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-audio-tts - Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated ### vllm-omni-contrib - Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models - Changes: - New feature: [Docs] Add Wan2.1-T2V as supported video generation models ### vllm-omni-video-gen - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-perf - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-audio-tts - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-perf - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-api - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-perf - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-contrib - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-serving - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-cicd - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-image-gen - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-contrib - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-distributed - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-quantization - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-cicd - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-perf - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-contrib - Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0 ### vllm-omni-contrib - Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section - Changes: - New feature: Add `Governance` section ### vllm-omni-distributed - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism ### vllm-omni-cicd - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism

bugfix: online serving doesn't read extra_body

2a59b1d

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 20, 2026 09:23

fhfuih requested a review from hsliuustc0106 as a code owner March 20, 2026 09:23

chatgpt-codex-connector Bot reviewed Mar 20, 2026

View reviewed changes

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated

Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated

Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated

Copilot started reviewing on behalf of fhfuih March 20, 2026 09:36 View session

fix comment grammar

0898394

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

[WIP] temporarily trigger CI

bc23325

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

wtomin added the ready label to trigger buildkite CI label Mar 20, 2026

fhfuih added 2 commits March 20, 2026 18:46

revert changed made to Omni endpoint

3a4b5a8

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

[WIP] run more tests in CI

d3fa375

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih changed the title ~~bugfix: online serving doesn't read extra_body~~ bugfix: /chat/completion doesn't read extra_body for diffusion model Mar 20, 2026

SamitHuang reviewed Mar 20, 2026

View reviewed changes

Update vllm_omni/entrypoints/openai/serving_chat.py

5bcf388

Signed-off-by: Samit <285365963@qq.com>

SamitHuang mentioned this pull request Mar 20, 2026

[Doc] Improve diffusion generation parameter docs for online serving #2051

Merged

5 tasks

Merge branch 'main' into fix-online-extra-body

3af8418

Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>

add extra_body fallback read for Omni handler as well

de2d89d

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

revert temporary CI check

e7dedd2

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

SamitHuang approved these changes Mar 21, 2026

View reviewed changes

resolve conflict

06dd36b

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

SamitHuang reviewed Mar 21, 2026

View reviewed changes

shorten comment

0917109

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

Gaohan123 changed the title ~~bugfix: /chat/completion doesn't read extra_body for diffusion model~~ [bugfix] /chat/completion doesn't read extra_body for diffusion model Mar 21, 2026

Gaohan123 merged commit da21e99 into vllm-project:main Mar 21, 2026
7 of 8 checks passed

fhfuih deleted the fix-online-extra-body branch March 23, 2026 02:59

Conversation

fhfuih commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Baseline: Without this PR

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

fhfuih Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wtomin commented Mar 20, 2026

Uh oh!

fhfuih commented Mar 20, 2026

Uh oh!

fhfuih commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

More logs

Diffusion

Omni (llm)

Conclusion

Uh oh!

SamitHuang Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

fhfuih Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhfuih commented Mar 20, 2026

Uh oh!

Uh oh!

SamitHuang Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhfuih commented Mar 21, 2026

Uh oh!

fhfuih commented Mar 21, 2026

Uh oh!

SamitHuang left a comment

Choose a reason for hiding this comment

Uh oh!

SamitHuang Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

fhfuih Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

fhfuih commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

fhfuih commented Mar 20, 2026 •

edited

Loading

fhfuih commented Mar 20, 2026 •

edited

Loading

SamitHuang commented Mar 20, 2026 •

edited

Loading

SamitHuang commented Mar 20, 2026 •

edited

Loading