[bugfix] /chat/completion doesn't read extra_body for diffusion model#2042
[bugfix] /chat/completion doesn't read extra_body for diffusion model#2042Gaohan123 merged 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a59b1d511
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| extra_body = request.model_extra or {} | ||
|
|
||
| # Parse size if provided (supports "1024x1024" format) | ||
| height = extra_body.get("height") |
There was a problem hiding this comment.
Preserve nested
extra_body payloads for raw HTTP callers
This assumes custom generation args are always flattened into request.model_extra, but our documented non-SDK clients still send a nested {"extra_body": {...}} object (for example docs/user_guide/examples/online_serving/text_to_image.md:58-71 and examples/online_serving/glm_image/openai_chat_client.py:68-79). Because ChatCompletionRequest accepts arbitrary extra fields, that shape arrives as request.extra_body / request.model_extra["extra_body"], so the following .get("height"), .get("num_inference_steps"), etc. all return None. After this change, curl/requests.post(json=payload) callers silently lose their generation parameters in this diffusion path and in the identical block above.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
False positive. Reason already explained in comment. requests.post(json=payload) only received the correct params after this PR
There was a problem hiding this comment.
Pull request overview
Fixes online chat/diffusion serving not honoring extra_body parameters by reading Pydantic “extra” fields after vLLM’s request preprocessing removes the nested extra_body.
Changes:
- Switches parameter extraction from
request.extra_bodytorequest.model_extrafor online serving. - Adds inline comments explaining vLLM preprocessing + Pydantic behavior behind the issue.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
|
It is critical. Can I trigger the CI to check if the error of #2036 is fixed. Change it to ready-merge tag? |
Yeah, please go ahead. Meanwhile I am working on attaching more logs. |
More logsDiffusionadding the following snippet to logger.info(f"@@@@@@@@@@ request is {request}")
logger.info(f"@@@@@@@@@@ raw request is {await raw_request.json()}")
logger.info(f"@@@@@@@@@@ correct extra_body from request.model_extra is {request.model_extra}")
logger.info(f"@@@@@@@@@@ wrong extra_body from `getattr(request, 'extra_body') or {}` is {getattr(request, 'extra_body') or {}}")
logger.info(f'\n@@@@@\n'.join(f"{name}: {x}" for x, name in zip([
num_inference_steps,
guidance_scale,
true_cfg_scale,
seed,
negative_prompt,
num_outputs_per_prompt,
num_frames,
guidance_scale_2,
lora_body,
layers,
resolution,
], [
"num_inference_steps",
"guidance_scale",
"true_cfg_scale",
"seed",
"negative_prompt",
"num_outputs_per_prompt",
"num_frames",
"guidance_scale_2",
"lora_body",
"layers",
"resolution",
])))and we get Note that Omni (llm)Adding the following snippet to And launch Bagel for example. Send request And we get the following In this case, the new way to get extra_body is actually wrong (because there will be a nested ConclusionOnly make the change to the diffusion /chat/completion handler |
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
| # In addition, these extra attrs are hidden as the default behavior of Pydantic `BaseModel` | ||
| # (which `ChatCompletionRequest` inherits from, and these fields are not explicitly defined). | ||
| # They are ONLY accessible via model_extra property. Cannot get via getattr(request, "num_inference_steps") | ||
| extra_body = request.model_extra or {} |
There was a problem hiding this comment.
why do we have mode_extra field here, which does not present in the curl request
There was a problem hiding this comment.
It is an attribute of Pydantic BaseModel, which is a base class of vllm's ChatCompletionRequest, which is the type of the request parameter.
(Pydantic offers an enhanced dataclass with runtime type validation)
https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_extra
This attribute stores all the fields that are provided during the init of a Pydantic dataclass and not explicitly defined inside the dataclass.
The content of the extra_body dict itself is somehow extracted and then merged into the kwargs that initialized this request dataclass (at least in some occassions), so the values goes there
There was a problem hiding this comment.
i see. So the reasonable fix will be handling the curl nested-dict format at first getattr(request, "extra_body", None), if None, check the SDK-flattened params stored in request.model_extra
|
I tests with the following requests, num_inference_steps in extra_body can be parsed correctly. Output: correct using 2 steps |
Hmm... 🤔 Then maybe I need to invesigate this matter further. Maybe there are some additional conditions that triggers this data preprocessing of the |
| # In addition, these extra attrs are hidden as the default behavior of Pydantic `BaseModel` | ||
| # (which `ChatCompletionRequest` inherits from, and these fields are not explicitly defined). | ||
| # They are ONLY accessible via model_extra property. Cannot get via getattr(request, "num_inference_steps") | ||
| extra_body = request.model_extra or {} |
There was a problem hiding this comment.
i see. So the reasonable fix will be handling the curl nested-dict format at first getattr(request, "extra_body", None), if None, check the SDK-flattened params stored in request.model_extra
Signed-off-by: Samit <285365963@qq.com>
The code expect the users to run curl with "extra_body" as a nested JSON key. I think I can
Figured it out. It's because I used curl with "extra_body" as a nested JSON key, while you used SDK-flattend format. The latest commits will make curl-nested, SDK-flattened, AND top-level params all work transparently. Btw, I think we need to update the online serving docs to reduce users' confusion on this part. I will do in #2051 |
Signed-off-by: Gao Han <hgaoaf@connect.ust.hk>
|
Ohh, so the flattening happends in the client side, using OpenAI client---not on the server side. That's why I didn't see relevant code. I will also take a look at your commit above. Thanks for the clarification |
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
|
That also explains why I cannot reproduce the same behavior when testing the BAGEL model late yesterday, because I was sending curl requests. So, I also added the same logic to Omni version of the api handler just now. Together with updated comments to match this discovery |
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
| # [NOTE] When sending request from openai client Python library (.chat.completions.create), | ||
| # `extra_body` argument is flattented and merged into the payload's root. | ||
| # and there is no more `extra_body` attribute in request. | ||
| # Since these fields are not declared in | ||
| # vllm.entrypoints.openai.chat_completion.protocol.ChatCompletionRequest, and this is a | ||
| # Pydantic `BaseModel` with runtime validation, these fields are not directly accessible as | ||
| # `request` attributes, but only via `model_extra` property. | ||
| # When sending raw request with curl, this flattening does not occur. | ||
| # We directly get the `extra_body` dict. |
There was a problem hiding this comment.
can we make these coments more concise?
There was a problem hiding this comment.
Done, please check again~
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
|
nightly test result is at https://buildkite.com/vllm/vllm-omni/builds/4618/steps/canvas?sid=019d0ea3-9a38-4e34-8bd9-7d3c613b902c Temporary CI modifications are reverted |
### vllm-omni-audio-tts - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-perf - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-api - Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection - Changes: - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection ### vllm-omni-contrib - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-cicd - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-api - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-perf - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-contrib - Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 ### vllm-omni-serving - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-contrib - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-api - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0` ### vllm-omni-cicd - Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test. - Changes: - Bug fix: [CI] Fix test. ### vllm-omni-cicd - Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml` ### vllm-omni-cicd - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-perf - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-serving - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-image-gen - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-perf - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-serving - Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni - Changes: - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni ### vllm-omni-image-gen - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images - Additions: - Qwen-Image-Layered - Qwen-Image-Layered - Qwen-Image-Layered ### vllm-omni-api - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images ### vllm-omni-cicd - Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) ### vllm-omni-serving - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-audio-tts - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-perf - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-serving - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-api - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-serving - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-cicd - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-api - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Additions: - `/v1/chat/completions` ### vllm-omni-perf - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) ### vllm-omni-contrib - Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case) - Changes: - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case) ### vllm-omni-contrib - Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support ### vllm-omni-cicd - Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash - Changes: - Bug fix: Fix Base voice clone streaming quality and stop-token crash ### vllm-omni-cicd - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models - Changes: - Bug fix: Fix OmniGen2 transformer config loading for HF models ### vllm-omni-audio-tts - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-perf - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-audio-tts - Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated ### vllm-omni-contrib - Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models - Changes: - New feature: [Docs] Add Wan2.1-T2V as supported video generation models ### vllm-omni-video-gen - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-perf - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-audio-tts - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-perf - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-api - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-perf - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-contrib - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-serving - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-cicd - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-image-gen - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-contrib - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-distributed - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-quantization - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-cicd - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-perf - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-contrib - Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0 ### vllm-omni-contrib - Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section - Changes: - New feature: Add `Governance` section ### vllm-omni-distributed - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism ### vllm-omni-cicd - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism
Purpose
The online endpoints currently fail to read extra_body when it routes to the diffusion mode. However, the Omni mode is normal.
For some reason, the output image's width and height is still correct... BUT at least the number of inference steps are not respected. Other params are not investigated, but this is sufficient to mark this issue.
This implicitly causes more severe problem after PR #1979 merged. Whether to step a cache backend now depends on the number of inference steps, and the number of inference steps is strictly read from user input, with no default values. As the online mode cannot read this property successfully, somehow it led Qwen-Image-Edit-2509 to crash with TeaCache (#2036).
The cause is noted as comments in the source code. It is jointly caused by vLLM's data preprocessing at the API layer and the further implicit Pydantic behavior.
Test Plan
Baseline: Without this PR
To reproduce #2036
Run this test
pytest tests/e2e/online_serving/test_qwen_image_edit_expansion.py -k 'test_qwen_image_edit_2509[single_card_001]' -s -vIt should fail, as noted in #2036.
To also see the number of inference steps:
serve model with
vllm serve /home/models/Qwen/Qwen-Image-Edit-2509 --omni --host 127.0.0.1 --port 38205 --stage-init-timeout 120(without cache backend)Then send a request with extra_body
You should see:
Still 50 steps.
Plus, if you add additional
logger.info(f"@@@@@@@@ extra body is {extra_body}")in_create_diffusion_chat_completion, you will see it prints None.Test Result
After this PR, the above pytest case can pass on my side. Plus, the inference progress bar now shows
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)