Skip to content

[bugfix] /chat/completion doesn't read extra_body for diffusion model#2042

Merged
Gaohan123 merged 11 commits intovllm-project:mainfrom
fhfuih:fix-online-extra-body
Mar 21, 2026
Merged

[bugfix] /chat/completion doesn't read extra_body for diffusion model#2042
Gaohan123 merged 11 commits intovllm-project:mainfrom
fhfuih:fix-online-extra-body

Conversation

@fhfuih
Copy link
Copy Markdown
Contributor

@fhfuih fhfuih commented Mar 20, 2026

Purpose

The online endpoints currently fail to read extra_body when it routes to the diffusion mode. However, the Omni mode is normal.

For some reason, the output image's width and height is still correct... BUT at least the number of inference steps are not respected. Other params are not investigated, but this is sufficient to mark this issue.

This implicitly causes more severe problem after PR #1979 merged. Whether to step a cache backend now depends on the number of inference steps, and the number of inference steps is strictly read from user input, with no default values. As the online mode cannot read this property successfully, somehow it led Qwen-Image-Edit-2509 to crash with TeaCache (#2036).

  • I haven't investigated why Qwen-Image-Edit-2509 runs well with cachedit, and why Qwen-Image-Edit or some other models run fine with both cache backend... This is also peculiar

The cause is noted as comments in the source code. It is jointly caused by vLLM's data preprocessing at the API layer and the further implicit Pydantic behavior.

Test Plan

Baseline: Without this PR

To reproduce #2036

Run this test pytest tests/e2e/online_serving/test_qwen_image_edit_expansion.py -k 'test_qwen_image_edit_2509[single_card_001]' -s -v

It should fail, as noted in #2036.

To also see the number of inference steps:

serve model with vllm serve /home/models/Qwen/Qwen-Image-Edit-2509 --omni --host 127.0.0.1 --port 38205 --stage-init-timeout 120 (without cache backend)

Then send a request with extra_body

curl -s http://localhost:38205/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "A beautiful landscape painting"}
    ],
    "extra_body": {
      "height": 1024,
      "width": 1024,
      "num_inference_steps": 2,
      "true_cfg_scale": 4.0,
      "seed": 42
    }
  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png

You should see:

(There will be a dummy run progress bar, always 1/1
...
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos, Methods: GET
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos/{video_id}, Methods: GET
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos/{video_id}, Methods: DELETE
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:47] Route: /v1/videos/{video_id}/content, Methods: GET
(APIServer pid=14519) INFO 03-20 08:48:52 [launcher.py:58] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=14519) INFO:     Started server process [14519]
(APIServer pid=14519) INFO:     Waiting for application startup.
(APIServer pid=14519) INFO:     Application startup complete.
Server ready on 127.0.0.1:54525
OmniServer started successfully
...
(request received)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:30<00:00,  1.64it/s]

Still 50 steps.

Plus, if you add additional logger.info(f"@@@@@@@@ extra body is {extra_body}") in _create_diffusion_chat_completion, you will see it prints None.

Test Result

After this PR, the above pytest case can pass on my side. Plus, the inference progress bar now shows

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.67s/it]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 09:23
@fhfuih fhfuih requested a review from hsliuustc0106 as a code owner March 20, 2026 09:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a59b1d511

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 2051 to 2054
extra_body = request.model_extra or {}

# Parse size if provided (supports "1024x1024" format)
height = extra_body.get("height")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve nested extra_body payloads for raw HTTP callers

This assumes custom generation args are always flattened into request.model_extra, but our documented non-SDK clients still send a nested {"extra_body": {...}} object (for example docs/user_guide/examples/online_serving/text_to_image.md:58-71 and examples/online_serving/glm_image/openai_chat_client.py:68-79). Because ChatCompletionRequest accepts arbitrary extra fields, that shape arrives as request.extra_body / request.model_extra["extra_body"], so the following .get("height"), .get("num_inference_steps"), etc. all return None. After this change, curl/requests.post(json=payload) callers silently lose their generation parameters in this diffusion path and in the identical block above.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive. Reason already explained in comment. requests.post(json=payload) only received the correct params after this PR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes online chat/diffusion serving not honoring extra_body parameters by reading Pydantic “extra” fields after vLLM’s request preprocessing removes the nested extra_body.

Changes:

  • Switches parameter extraction from request.extra_body to request.model_extra for online serving.
  • Adds inline comments explaining vLLM preprocessing + Pydantic behavior behind the issue.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated
Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated
Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 20, 2026

It is critical. Can I trigger the CI to check if the error of #2036 is fixed. Change it to ready-merge tag?

@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Mar 20, 2026

It is critical. Can I trigger the CI to check if the error of #2036 is fixed. Change it to ready-merge tag?

Yeah, please go ahead. Meanwhile I am working on attaching more logs.

@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Mar 20, 2026

More logs

Diffusion

adding the following snippet to async def _create_diffusion_chat_completion (vllm_omni/entrypoints/openai/serving_chat.py, around Line 2020)

            logger.info(f"@@@@@@@@@@ request is {request}")
            logger.info(f"@@@@@@@@@@ raw request is {await raw_request.json()}")
            logger.info(f"@@@@@@@@@@ correct extra_body from request.model_extra is {request.model_extra}")
            logger.info(f"@@@@@@@@@@ wrong extra_body from `getattr(request, 'extra_body') or {}` is {getattr(request, 'extra_body') or {}}")
            logger.info(f'\n@@@@@\n'.join(f"{name}: {x}" for x, name in zip([
                num_inference_steps,
                guidance_scale,
                true_cfg_scale,
                seed,
                negative_prompt,
                num_outputs_per_prompt,
                num_frames,
                guidance_scale_2,
                lora_body,
                layers,
                resolution,
            ], [
                "num_inference_steps",
                "guidance_scale",
                "true_cfg_scale",
                "seed",
                "negative_prompt",
                "num_outputs_per_prompt",
                "num_frames",
                "guidance_scale_2",
                "lora_body",
                "layers",
                "resolution",
            ])))

and we get

(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2082] @@@@@@@@@@ request is messages=... model='/home/models/Qwen/Qwen-Image-Edit-2509' frequency_penalty=0.0 logit_bias=None logprobs=False top_logprobs=0 max_tokens=None max_completion_tokens=None n=1 presence_penalty=0.0 response_format=None seed=42 stop=[] stream=False stream_options=None temperature=None top_p=None tools=None tool_choice='none' reasoning_effort=None include_reasoning=True parallel_tool_calls=True user=None use_beam_search=False top_k=None min_p=None repetition_penalty=None length_penalty=1.0 stop_token_ids=[] include_stop_str_in_output=False ignore_eos=False min_tokens=0 skip_special_tokens=True spaces_between_special_tokens=True truncate_prompt_tokens=None prompt_logprobs=None allowed_token_ids=None bad_words=[] echo=False add_generation_prompt=True continue_final_message=False add_special_tokens=False documents=None chat_template=None chat_template_kwargs=None mm_processor_kwargs=None structured_outputs=None priority=0 request_id='86aa19c604379ef7' return_tokens_as_token_ids=None return_token_ids=None cache_salt=None kv_transfer_params=None vllm_xargs=None repetition_detection=None height=512 width=512 num_inference_steps=2 negative_prompt='blurry, low quality, modern, geometrist' true_cfg_scale=4.0
(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2083] @@@@@@@@@@ raw request is {'messages': ..., 'model': '/home/models/Qwen/Qwen-Image-Edit-2509', 'height': 512, 'width': 512, 'num_inference_steps': 2, 'negative_prompt': 'blurry, low quality, modern, geometrist', 'true_cfg_scale': 4.0, 'seed': 42}
(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2084] @@@@@@@@@@ correct extra_body from request.model_extra is {'height': 512, 'width': 512, 'num_inference_steps': 2, 'negative_prompt': 'blurry, low quality, modern, geometrist', 'true_cfg_scale': 4.0}
(APIServer pid=27658) INFO 03-20 09:52:04 [serving_chat.py:2085] @@@@@@@@@@ wrong extra_body from `getattr(request, 'extra_body', None) or {}` is {}

Note that request being a subclass of vLLM/Pydantic class has custom __str__ print formats. It does have all fields in the top-level. If you try directly getattr on extra args, it will still fail. Has to access them via model_extra


Omni (llm)

⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️ note that it is completely the other way around

Adding the following snippet to async def create_chat_completion (vllm_omni/entrypoints/openai/serving_chat.py, around line 312)

                logger.info(f"~~~~~~~~~~~~ request is {request}")
                logger.info(f"~~~~~~~~~~~~ raw request is {await raw_request.json()}")
                logger.info(f"~~~~~~~~~~~~ correct extra_body from request.model_extra is {request.model_extra}")
                logger.info(f"~~~~~~~~~~~~ wrong extra_body from `getattr(request, 'extra_body', None) or {}` is {getattr(request, 'extra_body', None) or {}}")
                logger.info(f'\n~~~~~~~~~~~~\n'.join(f"{name}: {x}" for x, name in zip([
                    height,
                    width,
                    negative_prompt,
                ], [
                    "height",
                    "width",
                    "negative_prompt",
                ])))

And launch Bagel for example.

vllm  serve /home/models/ByteDance-Seed/BAGEL-7B-MoT --omni --host 127.0.0.1 --port 44791 --stage-init-timeout 120 --stage-init-timeout 300

Send request

curl -s http://localhost:44791/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "A beautiful landscape painting"}
    ], "modalities": ["image"],
    "extra_body": {
      "height": 256,
      "width": 256,
      "num_inference_steps": 2,
      "true_cfg_scale": 4.0,
      "seed": 42
    }
  }' | tee /dev/stderr | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > output.png

And we get the following

(APIServer pid=34558) INFO 03-20 10:37:15 [serving_chat.py:315] ~~~~~~~~~~~~ raw request is {'messages': [{'role': 'user', 'content': 'A beautiful landscape painting'}], 'modalities': ['image'], 'extra_body': {'height': 256, 'width': 256, 'num_inference_steps': 2, 'true_cfg_scale': 4.0, 'seed': 42}}
(APIServer pid=34558) INFO 03-20 10:37:15 [serving_chat.py:316] ~~~~~~~~~~~~ correct extra_body from request.model_extra is {'modalities': ['image'], 'extra_body': {'height': 256, 'width': 256, 'num_inference_steps': 2, 'true_cfg_scale': 4.0, 'seed': 42}}
(APIServer pid=34558) INFO 03-20 10:37:15 [serving_chat.py:317] ~~~~~~~~~~~~ wrong extra_body from `getattr(request, 'extra_body', None) or {}` is {'height': 256, 'width': 256, 'num_inference_steps': 2, 'true_cfg_scale': 4.0, 'seed': 42}

In this case, the new way to get extra_body is actually wrong (because there will be a nested extra_body dict). I haven't found out why 😂 😂

Conclusion

Only make the change to the diffusion /chat/completion handler

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@wtomin wtomin added the ready label to trigger buildkite CI label Mar 20, 2026
fhfuih added 2 commits March 20, 2026 18:46
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@fhfuih fhfuih changed the title bugfix: online serving doesn't read extra_body bugfix: /chat/completion doesn't read extra_body for diffusion model Mar 20, 2026
# In addition, these extra attrs are hidden as the default behavior of Pydantic `BaseModel`
# (which `ChatCompletionRequest` inherits from, and these fields are not explicitly defined).
# They are ONLY accessible via model_extra property. Cannot get via getattr(request, "num_inference_steps")
extra_body = request.model_extra or {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have mode_extra field here, which does not present in the curl request

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an attribute of Pydantic BaseModel, which is a base class of vllm's ChatCompletionRequest, which is the type of the request parameter.

(Pydantic offers an enhanced dataclass with runtime type validation)

https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_extra

This attribute stores all the fields that are provided during the init of a Pydantic dataclass and not explicitly defined inside the dataclass.

The content of the extra_body dict itself is somehow extracted and then merged into the kwargs that initialized this request dataclass (at least in some occassions), so the values goes there

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. So the reasonable fix will be handling the curl nested-dict format at first getattr(request, "extra_body", None), if None, check the SDK-flattened params stored in request.model_extra

@SamitHuang
Copy link
Copy Markdown
Collaborator

SamitHuang commented Mar 20, 2026

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

vllm serve "${MODEL:-Qwen/Qwen-Image-Edit}" --omni --port "${PORT:-8092}"
#!/usr/bin/env bash
# POST /v1/chat/completions with image + text for Qwen-Image-Edit (text first, then image).
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
INPUT_IMG="${1:-$SCRIPT_DIR/rabbit.png}"
PROMPT="${2:-Convert this image to watercolor painting style}"
URL="${QIE_URL:-http://localhost:8092/v1/chat/completions}"

if [[ ! -f "$INPUT_IMG" ]]; then
  echo "Image not found: $INPUT_IMG" >&2
  exit 1
fi

base64 -w0 "$INPUT_IMG" | jq -Rs --arg model "Qwen/Qwen-Image-Edit" --arg prompt "$PROMPT" '{
  model: $model,
  messages: [{
    role: "user",
    content: [
      {type: "text", text: $prompt},
      {type: "image_url", image_url: {url: ("data:image/png;base64," + .)}}
    ]
  }],
  extra_body: {
    num_inference_steps: 2,
    guidance_scale: 1.0,
    true_cfg_scale: 4.0,
    seed: 0
  }
}' | curl -sS -X POST "$URL" \
  -H "Content-Type: application/json" \
  -d @- \
  --max-time 600

Output: correct using 2 steps

(APIServer pid=170411) WARNING 03-20 11:16:28 [protocol.py:51] The following fields were present in the request but ignored: {'extra_body'}
(APIServer pid=170411) INFO 03-20 11:16:28 [serving_chat.py:2073] Diffusion chat request chatcmpl-5d4faf419b3e4259: prompt='Convert this image to watercolor painting style', ref_images=1, params={'num_inference_steps': 2, 'guidance_scale': 1.0, 'true_cfg_scale': 4.0, 'seed': 0}
(APIServer pid=170411) INFO 03-20 11:16:28 [orchestrator.py:577] [Orchestrator] _handle_add_request: stage=0 req=chatcmpl-5d4faf419b3e4259 prompt_type=dict original_prompt_type=dict final_stage=0 num_sampling_params=1
(APIServer pid=170411) INFO 03-20 11:16:28 [diffusion_engine.py:87] Pre-processing completed in 0.0478 seconds
INFO 03-20 11:16:28 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-20 11:16:28 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 11:16:28 [pipeline_qwen_image_edit.py:640] negative_prompt is not set. The official Qwen-Image-Edit model may produce lower-quality results without a negative_prompt. Qwen official repository recommends to use whitespace string as negative_prompt. Note: some distilled variants may not be affected by this.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.86it/s]
WARNING 03-20 11:16:29 [diffusion_worker.py:385] SHM pack failed, falling back to raw enqueue: Got unsupported ScalarType BFloat16
(APIServer pid=170411) INFO 03-20 11:16:29 [diffusion_engine.py:95] Generation completed successfully.
(APIServer pid=170411) INFO 03-20 11:16:29 [diffusion_engine.py:128] Post-processing completed in 0.0479 seconds
(APIServer pid=170411) INFO 03-20 11:16:29 [diffusion_engine.py:131] DiffusionEngine.step breakdown: preprocess=47.78 ms, add_req_and_wait=1011.50 ms, postprocess=47.86 ms, total=1107.69 ms
(APIServer pid=170411) INFO 03-20 11:16:29 [omni_base.py:154] [Summary] {}
(APIServer pid=170411) INFO 03-20 11:16:30 [serving_chat.py:2238] Diffusion chat completed for request chatcmpl-5d4faf419b3e4259: 1 images

@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Mar 20, 2026

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

Hmm... 🤔 Then maybe I need to invesigate this matter further. Maybe there are some additional conditions that triggers this data preprocessing of the request object. (And that's maybe why Qwen-Image-Edit didn't break in nightly test after #1979 is merged

Comment thread vllm_omni/entrypoints/openai/serving_chat.py Outdated
# In addition, these extra attrs are hidden as the default behavior of Pydantic `BaseModel`
# (which `ChatCompletionRequest` inherits from, and these fields are not explicitly defined).
# They are ONLY accessible via model_extra property. Cannot get via getattr(request, "num_inference_steps")
extra_body = request.model_extra or {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. So the reasonable fix will be handling the curl nested-dict format at first getattr(request, "extra_body", None), if None, check the SDK-flattened params stored in request.model_extra

Signed-off-by: Samit <285365963@qq.com>
@SamitHuang
Copy link
Copy Markdown
Collaborator

SamitHuang commented Mar 20, 2026

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

Hmm... 🤔 Then maybe I need to invesigate this matter further. Maybe there are some additional conditions that triggers this data preprocessing of the request object. (And that's maybe why Qwen-Image-Edit didn't break in nightly test after #1979 is merged

The code expect the users to run curl with "extra_body" as a nested JSON key. I think I can

I tests with the following requests, num_inference_steps in extra_body can be parsed correctly.

Hmm... 🤔 Then maybe I need to invesigate this matter further. Maybe there are some additional conditions that triggers this data preprocessing of the request object. (And that's maybe why Qwen-Image-Edit didn't break in nightly test after #1979 is merged

Figured it out. It's because I used curl with "extra_body" as a nested JSON key, while you used SDK-flattend format. The latest commits will make curl-nested, SDK-flattened, AND top-level params all work transparently.

Btw, I think we need to update the online serving docs to reduce users' confusion on this part. I will do in #2051

Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Mar 21, 2026

Ohh, so the flattening happends in the client side, using OpenAI client---not on the server side. That's why I didn't see relevant code. I will also take a look at your commit above. Thanks for the clarification

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Mar 21, 2026

That also explains why I cannot reproduce the same behavior when testing the BAGEL model late yesterday, because I was sending curl requests.

So, I also added the same logic to Omni version of the api handler just now. Together with updated comments to match this discovery

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Comment on lines +2050 to +2058
# [NOTE] When sending request from openai client Python library (.chat.completions.create),
# `extra_body` argument is flattented and merged into the payload's root.
# and there is no more `extra_body` attribute in request.
# Since these fields are not declared in
# vllm.entrypoints.openai.chat_completion.protocol.ChatCompletionRequest, and this is a
# Pydantic `BaseModel` with runtime validation, these fields are not directly accessible as
# `request` attributes, but only via `model_extra` property.
# When sending raw request with curl, this flattening does not occur.
# We directly get the `extra_body` dict.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these coments more concise?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please check again~

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
@fhfuih
Copy link
Copy Markdown
Contributor Author

fhfuih commented Mar 21, 2026

nightly test result is at https://buildkite.com/vllm/vllm-omni/builds/4618/steps/canvas?sid=019d0ea3-9a38-4e34-8bd9-7d3c613b902c

Temporary CI modifications are reverted

@Gaohan123 Gaohan123 changed the title bugfix: /chat/completion doesn't read extra_body for diffusion model [bugfix] /chat/completion doesn't read extra_body for diffusion model Mar 21, 2026
@Gaohan123 Gaohan123 merged commit da21e99 into vllm-project:main Mar 21, 2026
7 of 8 checks passed
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 22, 2026
### vllm-omni-audio-tts
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-perf
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-api
- Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection
- Changes:
  - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection

### vllm-omni-contrib
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-cicd
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-api
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-perf
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-contrib
- Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0

### vllm-omni-serving
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-contrib
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-api
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0`

### vllm-omni-cicd
- Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test.
- Changes:
  - Bug fix: [CI] Fix test.

### vllm-omni-cicd
- Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml`

### vllm-omni-cicd
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-perf
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-serving
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-image-gen
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-perf
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-serving
- Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni
- Changes:
  - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni

### vllm-omni-image-gen
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images
- Additions:
  - Qwen-Image-Layered
  - Qwen-Image-Layered
  - Qwen-Image-Layered

### vllm-omni-api
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images

### vllm-omni-cicd
- Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3)

### vllm-omni-serving
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-audio-tts
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-perf
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-serving
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-api
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-serving
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-cicd
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-api
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Additions:
  - `/v1/chat/completions`

### vllm-omni-perf
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)

### vllm-omni-contrib
- Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case)
- Changes:
  - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case)

### vllm-omni-contrib
- Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support

### vllm-omni-cicd
- Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash
- Changes:
  - Bug fix: Fix Base voice clone streaming quality and stop-token crash

### vllm-omni-cicd
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models
- Changes:
  - Bug fix: Fix OmniGen2 transformer config loading for HF models

### vllm-omni-audio-tts
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-perf
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-audio-tts
- Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated

### vllm-omni-contrib
- Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models
- Changes:
  - New feature: [Docs] Add Wan2.1-T2V as supported video generation models

### vllm-omni-video-gen
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-perf
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-audio-tts
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-perf
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-api
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-perf
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-contrib
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-serving
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-cicd
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-image-gen
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-contrib
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-distributed
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-quantization
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-cicd
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-perf
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-contrib
- Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0

### vllm-omni-contrib
- Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section
- Changes:
  - New feature: Add `Governance` section

### vllm-omni-distributed
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism

### vllm-omni-cicd
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism
@fhfuih fhfuih deleted the fix-online-extra-body branch March 23, 2026 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants