Skip to content

[Bugfix] Align Offline and Online Inference#3506

Open
skf-1999 wants to merge 3 commits into
vllm-project:mainfrom
skf-1999:offline-online
Open

[Bugfix] Align Offline and Online Inference#3506
skf-1999 wants to merge 3 commits into
vllm-project:mainfrom
skf-1999:offline-online

Conversation

@skf-1999
Copy link
Copy Markdown
Contributor

@skf-1999 skf-1999 commented May 11, 2026

Purpose

Align offline and online inference

Test Plan

t2i:

PROMPT='A cinematic medium shot captures a single Asian woman seated on a chair within a dimly lit room, creating an intimate and theatrical atmosphere. The composition is focused on the subject, rendered with rich colors and intricate textures that evoke a nostalgic and moody feeling.

The primary subject is a young Asian woman with a thoughtful and expressive countenance, her gaze directed slightly away from the camera. She is seated in a relaxed yet elegant posture on an ornate, vintage armchair. The chair is upholstered in a deep red velvet, its fabric showing detailed, intricate textures and slight signs of wear. She wears a simple, elegant dress in a dark teal hue, the material catching the light in a way that reveals its fine-woven texture. Her skin has a soft, matte quality, and the light delicately models the contours of her face and arms.

The surrounding room is characterized by its vintage decor, which contributes to the historic and evocative mood. In the immediate background, partially blurred due to a shallow depth of field consistent with a f/2.8 aperture, the wall is covered with wallpaper featuring a subtle, damask pattern. The overall color palette is a carefully balanced interplay of deep teal and rich red hues, creating a visually compelling and cohesive environment. The entire scene is detailed, from the fibers of the upholstery to the subtle patterns on the wall.

The lighting is highly dramatic and artistic, defined by high contrast and pronounced shadow play. A single key light source, positioned off-camera, projects gobo lighting patterns onto the scene, casting intricate shapes of light and shadow across the woman and the back wall. These dramatic shadows create a strong scense of depth and a theatrical quality. While some shadows are deep and defined, others remain soft, gently wrapping around the subject and preventing the loss of detail in darker areas. The soft focus on the background enhances the intimate feeling, drawing all attention to the expressive subject. The overall image presents a cinematic, photorealistic photography style.'

Offline:

python3 examples/offline_inference/hunyuan_image3/end2end.py \
  --modality text2img \
  --model /data/HunyuanImage-3.0-Instruct \
  --prompts "$PROMPT" \
  --bot-task think \
  --sys-type en_unified \
  --seed 42 \
  --steps 50 \
  --output ./out_t2i_think_unified \
  --deploy-config vllm_omni/deploy/hunyuan_image3_dit.yaml

Online:

curl -X POST http://localhost:8091/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A cinematic medium shot captures a single Asian woman seated on a chair within a dimly lit room, creating an intimate and theatrical atmosphere. The composition is focused on the subject, rendered with rich colors and intricate textures that evoke a nostalgic and moody feeling. The primary subject is a young Asian woman with a thoughtful and expressive countenance, her gaze directed slightly away from the camera. She is seated in a relaxed yet elegant posture on an ornate, vintage armchair. The chair is upholstered in a deep red velvet, its fabric showing detailed, intricate textures and slight signs of wear. She wears a simple, elegant dress in a dark teal hue, the material catching the light in a way that reveals its fine-woven texture. Her skin has a soft, matte quality, and the light delicately models the contours of her face and arms. The surrounding room is characterized by its vintage decor, which contributes to the historic and evocative mood. In the immediate background, partially blurred due to a shallow depth of field consistent with a f/2.8 aperture, the wall is covered with wallpaper featuring a subtle, damask pattern. The overall color palette is a carefully balanced interplay of deep teal and rich red hues, creating a visually compelling and cohesive environment. The entire scene is detailed, from the fibers of the upholstery to the subtle patterns on the wall. The lighting is highly dramatic and artistic, defined by high contrast and pronounced shadow play. A single key light source, positioned off-camera, projects gobo lighting patterns onto the scene, casting intricate shapes of light and shadow across the woman and the back wall. These dramatic shadows create a strong scense of depth and a theatrical quality. While some shadows are deep and defined, others remain soft, gently wrapping around the subject and preventing the loss of detail in darker areas. The soft focus on the background enhances the intimate feeling, drawing all attention to the expressive subject. The overall image presents a cinematic, photorealistic photography style.",
    "use_system_prompt": "en_unified",
    "bot_task": "think",
    "num_inference_steps": 50,
    "n": 4,
    "seed": 42
  }' | jq -r '.data[0].b64_json' | base64 -d > t2i_think_unified.png

Also added online AR cot_text output support.
The AR stage YAML needs to add final_output: true and final_output_type: text.

stages:
  - stage_id: 0
    final_output: true
    final_output_type: text

Online request:

curl -X POST http://localhost:8091/v1/images/edits \
  -F "image=@/data/edit_dog.png" \
  -F "prompt=新年宠物海报,Q版圆润的可爱标题\"新年快乐汪\",副标题\"HAPPY NEW YEAR\"。 鱼眼镜头,背景是房间门口,近景,上传的主体歪头笑,围着红色围巾,戴着红色毛线帽,高清,绒毛细节,面部特写。 宝丽莱相纸,超现实主义,写实主义,胶片摄影,打印颗粒感肌理。肌理,超写实,复古感。" \
  -F "use_system_prompt=en_unified" \
  -F "bot_task=think" \
  -F "n=1" \
  -F "num_inference_steps=50" \
  -F "guidance_scale=2.5" \
  -F "seed=42" \
  | jq '.cot_output'

Bot task type. Use full task names or simplified aliases: think, recaption, vanilla.

Test Result

Offline Online

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81f0e2772a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

images = getattr(result.request_output, "images", [])
stage_durations = result.stage_durations
peak_memory_mb = result.peak_memory_mb
logger.info(f"[DEBUG] all_outputs length={len(all_outputs)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard CoT extraction for non-AsyncOmni engines

In generate_diffusion_images, all_outputs is only initialized inside the if isinstance(engine, AsyncOmni) branch, but it is used unconditionally afterward; when engine is not AsyncOmni, this raises UnboundLocalError before the function can return images. This breaks single-stage/non-Async diffusion serving paths that rely on the else: result = await engine.generate(...) branch.

Useful? React with 👍 / 👎.

flat_images.append(item)

return flat_images, stage_durations, peak_memory_mb
return flat_images, stage_durations, peak_memory_mb, cot_output
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep return tuple compatible with existing image-edit caller

This function now returns four values (flat_images, stage_durations, peak_memory_mb, cot_output), but /v1/images/edits still unpacks only three (images, _, _ = generation_result in api_server.py), which causes ValueError: too many values to unpack on that multi-stage edit path. Either update all callers or return a backward-compatible shape.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images/edits is now supported

@skf-1999
Copy link
Copy Markdown
Contributor Author

skf-1999 commented May 11, 2026

Online CoT output isn't ready yet — will clean up redundant code ,all pictures and add CoT support later.

@hsliuustc0106 hsliuustc0106 added the frontend code related to entrypoint label May 11, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge conflict (CONFLICTING status). Please rebase before review.

@skf-1999 skf-1999 force-pushed the offline-online branch 2 times, most recently from dd48197 to 610f591 Compare May 12, 2026 08:06
@skf-1999
Copy link
Copy Markdown
Contributor Author

Online CoT output isn't ready yet — will clean up redundant code ,all pictures and add CoT support later.

CoT support has been added.

@skf-1999 skf-1999 changed the title [Bugfix]Aligning Offline and Online Text-to-Image (t2i) Inference [Bugfix] Align offline and online inference May 12, 2026
@skf-1999 skf-1999 changed the title [Bugfix] Align offline and online inference [Bugfix] Align Offline and Online Text-to-Image (t2i) Inference May 12, 2026
@skf-1999
Copy link
Copy Markdown
Contributor Author

Merge conflict (CONFLICTING status). Please rebase before review.

Resolved merge conflicts.

@skf-1999 skf-1999 force-pushed the offline-online branch 3 times, most recently from 74b45c3 to c966702 Compare May 13, 2026 03:35
@skf-1999 skf-1999 changed the title [Bugfix] Align Offline and Online Text-to-Image (t2i) Inference [Bugfix] Align Offline and Online Inference May 13, 2026
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label May 14, 2026
@Gaohan123 Gaohan123 added this to the v0.22.0 milestone May 14, 2026
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Here are 2 suggestions:

  1. Supplement UT for it
  2. Please refer to PR #3444 to confirm the list of supported bot tasks, as you two are not consistent.

@skf-1999 skf-1999 force-pushed the offline-online branch 2 times, most recently from 74cd091 to 1939c90 Compare May 18, 2026 07:06
@skf-1999
Copy link
Copy Markdown
Contributor Author

The buildkite/vllm-omni-intel-ci failure in test_mix_to_audio is a pre-existing environment issue — the Intel CI image is missing the eSpeak system dependency required by pyttsx3. This is unrelated to the changes in this PR.

@skf-1999
Copy link
Copy Markdown
Contributor Author

Thanks for the contribution. Here are 2 suggestions:

  1. Supplement UT for it
  2. Please refer to PR [Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup #3444 to confirm the list of supported bot tasks, as you two are not consistent.

Currently, unit tests are pending and will be added. The bot_task has already been unified following PR #3444.
Conflicts have been resolved.

@TaffyOfficial
Copy link
Copy Markdown
Contributor

P1: The use_system_prompt parameter for /v1/images/generations does not actually take effect.api_server.py (line 1538) passes extra_body["use_system_prompt"], while serving_chat.py (line 2251) only reads sys_type. As a result, the use_system_prompt="custom" or other modes passed by users in the Images Generation API are silently ignored, and the multi-stage pipeline fails to work as per the declared schema.
P1/P2: The bot_task schema is inconsistent with the actual contract of the Hunyuan prompt builder.images.py (line 116) allows legacy composite tasks such as t2i_think and it2i_recaption, but the bot_task in prompt_utils.py (line 67) only accepts think, recaption, think_recaption, vanilla, and None in practice. The values allowed by the schema will trigger an "Unknown bot_task" error in the builder when passed to serving_chat.py (line 2279). Meanwhile, the schema omits think_recaption, which is actually supported.
P2: all_outputs is still initialized only in the AsyncOmni branch.all_outputs is initialized in serving_chat.py (line 2501), but the code unconditionally iterates over it in serving_chat.py (line 2523). The current service initialization almost always follows the AsyncOmni path, so this does not cause immediate failure on the main path; however, the code compatibility branch is unstable.

@TaffyOfficial
Copy link
Copy Markdown
Contributor

P2: Online stop tokens must not be hardcoded; the stop tokens provided by the framework should be used instead.

@skf-1999
Copy link
Copy Markdown
Contributor Author

P1: The use_system_prompt parameter for /v1/images/generations does not actually take effect.api_server.py (line 1538) passes extra_body["use_system_prompt"], while serving_chat.py (line 2251) only reads sys_type. As a result, the use_system_prompt="custom" or other modes passed by users in the Images Generation API are silently ignored, and the multi-stage pipeline fails to work as per the declared schema. P1/P2: The bot_task schema is inconsistent with the actual contract of the Hunyuan prompt builder.images.py (line 116) allows legacy composite tasks such as t2i_think and it2i_recaption, but the bot_task in prompt_utils.py (line 67) only accepts think, recaption, think_recaption, vanilla, and None in practice. The values allowed by the schema will trigger an "Unknown bot_task" error in the builder when passed to serving_chat.py (line 2279). Meanwhile, the schema omits think_recaption, which is actually supported. P2: all_outputs is still initialized only in the AsyncOmni branch.all_outputs is initialized in serving_chat.py (line 2501), but the code unconditionally iterates over it in serving_chat.py (line 2523). The current service initialization almost always follows the AsyncOmni path, so this does not cause immediate failure on the main path; however, the code compatibility branch is unstable.

Will handle all of these in the next commit, along with unit tests.

Signed-off-by: skf1999 <13234016272@163.com>
skf-1999 added 2 commits May 19, 2026 11:40
Signed-off-by: skf1999 <13234016272@163.com>
Signed-off-by: skf1999 <13234016272@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend code related to entrypoint ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants