[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage by ZhengWG · Pull Request #6 · ZhengWG/vllm-omni

ZhengWG · 2026-04-26T14:38:35Z

Purpose

Adds the image generation (diffusion) stage for inclusionAI/Ming-flash-omni-2.0, completing the model's text-to-image and image-edit (img2img) capabilities on top of the already-merged thinker (vllm-project#1822) and talker / TTS (vllm-project#2890) stages. The diffusion stage runs MingImagePipeline (Qwen2 connector + ZImage DiT + VAE) on top of the AR thinker's hidden states, exposed end-to-end through the OpenAI /v1/chat/completions endpoint with modalities: ["image"].

This PR has been rebased / merged onto current main (the prior development branch was based on a much older commit before vllm-project#1822 / vllm-project#2890 / [Config Refactor][2/N] landed). It supersedes the previous draft and is intentionally not linked to it.

Key conflict-resolution highlights from the merge:

vllm_omni/diffusion/data.py::OmniDiffusionConfig.enrich_config: re-fold the BailingMM2NativeForConditionalGeneration / ming_flash_omni* -> MingImagePipeline mapping into the new try/else structure that already handles the diffusers adapter and nextstep.
vllm_omni/entrypoints/openai/serving_chat.py: keep the upstream OpenAI flow unchanged but reintroduce the Ming-specific <IMAGE> placeholder injection for img2img (so the thinker's ref-image substitution still fires when the multimodal cache is warm) and the is_image_gen mm_processor_kwargs flag (used by Ming's MM processor to expand query tokens).
vllm_omni/model_executor/models/ming_flash_omni/ming_flash_omni.py: keep the multi-stage load_weights (drops imagegen.* weights — they're loaded by MingImagePipeline from its own subfolders — and reports query_tokens_dict.* as loaded since they're pre-loaded inside MingFlashOmniThinker.__init__).
vllm_omni/model_executor/models/ming_flash_omni/prompt_utils.py: combine the image-gen query-token helpers with the TTS caption builder.
vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py: combine expand_cfg_prompts / thinker2imagegen (image-gen) with thinker2talker (TTS) into one processor module so all Ming stage YAMLs (text-to-image dual, TTS, omni-speech) keep working.
vllm_omni/transformers_utils/configs/ming_flash_omni.py: retain MingImageGenConfig, register it with AutoConfig, and expose it alongside MingFlashOmniTalkerConfig in MingFlashOmniConfig.sub_configs.

Usage

Launch the dual-stage server (AR thinker on TP=4 across GPUs 0-3 + diffusion stage on GPU 4 by default; tweak devices in the YAML for your hardware):

vllm serve Jonathan1909/Ming-flash-omni-2.0 \
    --omni \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni_dual.yaml \
    --trust-remote-code \
    --port 8188

Text-to-image:

curl http://127.0.0.1:8188/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Jonathan1909/Ming-flash-omni-2.0",
        "messages": [{"role": "user", "content": "Please draw a cute cat."}],
        "modalities": ["image"]
    }' -o /tmp/ming_response.json

python -c "
import base64, json
r = json.load(open('/tmp/ming_response.json'))
url = r['choices'][0]['message']['content'][0]['image_url']['url']
png = base64.b64decode(url.split(',')[1])
open('/tmp/ming_cat.png', 'wb').write(png)
print('PNG bytes:', len(png))
"

Image edit (img2img) — include a reference image as the first content item; the chat endpoint detects the reference image, prepends <IMAGE> to the prompt so the thinker's ref-image substitution still fires when the multimodal cache is warm, and routes through MingImagePipeline in img2img mode.

Optional generation knobs (extra_body):

Field	Default	Notes
`height`, `width`	1024×1024	Target image resolution.
`num_inference_steps`	30	DiT denoise step count.
`negative_prompt`	None	Triggers the CFG companion via `expand_cfg_prompts`.
`cfg_text_scale`	2.0	Classifier-free-guidance scale.
`seed`	request seed	Deterministic generation.

What's added

Implementation:

vllm_omni/diffusion/models/ming_flash_omni/: MingImagePipeline (pipeline_ming.py), Qwen2 connector + RMSNorm (condition_encoder.py), ZImage transformer with ref-image fusion hooks (ming_zimage_transformer.py), optional ByT5 glyph encoder (byte5_encoder.py + t5_block_mapper.py), plus the standard __init__.py / registry wiring.
vllm_omni/diffusion/registry.py: MingImagePipeline and its post-process hook.
vllm_omni/transformers_utils/configs/ming_flash_omni.py: MingImageGenConfig (subfolder layout, img_gen_scales, defaults from Ming's cookbook) registered with AutoConfig.
vllm_omni/model_executor/stage_configs/ming_flash_omni_dual.yaml: the dual-stage runtime YAML wiring AR thinker -> shared-memory connector -> diffusion stage with expand_cfg_prompts / thinker2imagegen.
vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py: expand_cfg_prompts (CFG companion expansion when negative_prompt is set) and thinker2imagegen (slice final_hidden_states at <imagePatch> positions, pack into the diffusion prompt, and emit a paired negative_thinker_hidden_states extra when a CFG companion output is present).
vllm_omni/entrypoints/openai/serving_chat.py: <IMAGE> placeholder injection for img2img + is_image_gen mm_processor_kwargs flag.

Examples + docs (per Diffusion Model Requirements):

examples/offline_inference/ming_flash_omni/README.md and examples/online_serving/ming_flash_omni/README.md: new "Image generation (thinker + diffusion)" sections with launch commands, text-to-image / image-edit curl snippets, and the full optional-knobs table.
recipes/inclusionAI/Ming-flash-omni-2.0.md: register a fourth deployment mode (thinker + diffusion) with reference 5×H100 hardware setup, full launch command, verification curl, and notes about CFG companion expansion + img2img placeholder.
docs/models/supported_models.md: register Ming-flash-omni-2.0 with its three architectures (MingFlashOmniForConditionalGeneration, MingFlashOmniTalkerForConditionalGeneration, MingImagePipeline).

Test Plan

L4 functionality tests added per docs/contributing/ci/test_examples/l4_functionality_tests.inc.md (Ming is a normal priority model, so we use a small focused matrix):

tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py (@pytest.mark.full_model @pytest.mark.diffusion @pytest.mark.omni):
- ming_imagegen_default: dual-stage default + CFG companion via negative_prompt.
- ming_imagegen_cache_dit: same + --cache-backend cache_dit.
- ming_imagegen_cfg_parallel_2: same + --cfg-parallel-size 2.
- ming_imagegen_img2img_default: image edit with a synthetic ref image, validates the <IMAGE> placeholder injection path end-to-end.
  Validation is delegated to assert_diffusion_response (height/width/payload shape).
tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py: drives the same dual-stage pipeline through the public Omni entrypoint and asserts a PIL image is emitted; at run_level={advanced_model, full_model} it also asserts the native 1024×1024 resolution.
tests/e2e/stage_configs/bailingmm_moe_v2_lite_imagegen_ci.yaml: CI stage config (with load_format: dummy by default; the dummy format is stripped automatically at higher run_levels by tests/helpers/fixtures/runtime.py).

To run locally:

# Online serving — full L4 matrix
pytest -xvs tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py

# Offline E2E
pytest -xvs tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py

# Existing thinker + TTS coverage continues to run
pytest -xvs tests/e2e/offline_inference/test_ming_flash_omni.py
pytest -xvs tests/e2e/online_serving/test_ming_flash_omni.py

Test Result

Manual verification (text-to-image, single H100 cluster, 30 steps, 1024×1024, prompt "Please draw a cute cat") produces a coherent generated image; see the original PR description for a sample output. Image edit (img2img) end-to-end was validated against a synthetic 512×512 reference image with the watercolour-painting edit instruction. The L4 tests above exercise the runtime wiring with load_format: dummy for cheap CI runs and the full pixel path at higher run levels.

cc @yuanheng-zhao @hsliuustc0106

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

Signed-off-by: ZhengWG <zwg0606@gmail.com>

Merges main (which now contains Ming-flash-omni-2.0 thinker (vllm-project#1822) and talker/TTS (vllm-project#2890) stages) into the image-generation feature branch. Conflict resolution highlights: - vllm_omni/diffusion/data.py::OmniDiffusionConfig.enrich_config: preserve the BailingMM2NativeForConditionalGeneration / ming_flash_omni branch that maps to MingImagePipeline (the diffusion-stage class for Ming image generation), and fold it into the upstream try/else structure that now also handles the diffusers adapter and nextstep. - vllm_omni/entrypoints/openai/serving_chat.py: keep the upstream OpenAI flow unchanged but reintroduce the Ming-specific <IMAGE> placeholder injection for img2img and the is_image_gen mm_processor flag (used by Ming's MM processor to expand query tokens). - vllm_omni/model_executor/models/ming_flash_omni/ming_flash_omni.py: drop the no-longer-needed talker NotImplementedError (talker is loaded directly via MingFlashOmniTalkerForConditionalGeneration on main), and document that imagegen runs as a separate diffusion stage. Keep the multi-stage load_weights with imagegen-prefix drop and query_tokens_dict tracking. - vllm_omni/model_executor/models/ming_flash_omni/prompt_utils.py: combine the image-gen query-token helpers (this branch) and the TTS caption builder (main) into a single module with a shared __all__. - vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py: combine expand_cfg_prompts / thinker2imagegen (this branch) and thinker2talker (main) into a single processor module so all Ming stage YAMLs (text-to-image dual, TTS, omni-speech) keep working. - vllm_omni/transformers_utils/configs/ming_flash_omni.py: retain both MingImageGenConfig (this branch) and MingFlashOmniTalkerConfig (main); register both with AutoConfig and expose both in MingFlashOmniConfig.sub_configs. Signed-off-by: cursor-agent <cursor-agent@cursor.sh> Co-authored-by: Zheng Wengang <zwg0606@gmail.com>

Document the thinker + diffusion (image generation) deployment path now that the image-gen stage merges back from the dual-stage YAML, per the Diffusion Model Requirements: - examples/offline_inference/ming_flash_omni/README.md and examples/online_serving/ming_flash_omni/README.md: add an 'Image generation (thinker + diffusion)' section with launch commands, text-to-image curl, image-edit (img2img) curl, and the full set of optional knobs accepted via extra_body (height/width, num_inference_steps, negative_prompt, cfg_text_scale, seed). - recipes/inclusionAI/Ming-flash-omni-2.0.md: register a fourth deployment mode (thinker + diffusion) with reference 5xH100 hardware recipe, full launch command, verification curl snippets, and notes about CFG companion expansion + img2img placeholder. - docs/models/supported_models.md: register Ming-flash-omni-2.0 with its three architectures (thinker AR, talker, MingImagePipeline) in the supported models table (CUDA only for now). - vllm_omni/diffusion/models/ming_flash_omni/__init__.py: restore the module-level docstring so 'import vllm_omni.diffusion.models.ming_flash_omni' is a regular package (not a namespace package). Signed-off-by: cursor-agent <cursor-agent@cursor.sh> Co-authored-by: Zheng Wengang <zwg0606@gmail.com>

…ation Per the Diffusion Model Requirements, every newly added diffusion model must ship at least an L4 *functionality* test (see docs/contributing/ci/test_examples/l4_functionality_tests.inc.md). Ming is a *normal priority* model in that matrix, so the new tests cover the end-to-end thinker -> diffusion image-gen path with a small set of distinct runtime configurations: - tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py: L4 expansion test parametrised over three online configs that each exercise a different feature toggle on top of the dual-stage ming_flash_omni_dual.yaml runtime: * default (CFG companion via negative_prompt) * cache_dit cache backend * cfg_parallel_size=2 Plus a separate img2img case that validates the chat endpoint's reference-image -> <IMAGE> placeholder injection and routing through MingImagePipeline in img2img mode. Validation is delegated to assert_diffusion_response (height/width/payload shape). - tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py: Offline counterpart that drives the same dual-stage pipeline through the public Omni entrypoint, asserts the diffusion stage emits a PIL image, and at run_level=advanced/full additionally checks the native 1024x1024 resolution. - tests/e2e/stage_configs/bailingmm_moe_v2_lite_imagegen_ci.yaml: CI stage config (load_format: dummy by default; stripped at higher run levels) wiring the AR thinker on TP=4 + the MingImagePipeline diffusion stage with shared_memory_connector cross-stage transport and the expand_cfg_prompts / thinker2imagegen processors. Signed-off-by: cursor-agent <cursor-agent@cursor.sh> Co-authored-by: Zheng Wengang <zwg0606@gmail.com>

ZhengWG and others added 17 commits April 19, 2026 16:08

naive support dit

3ae8678

Signed-off-by: ZhengWG <zwg0606@gmail.com>

feat: naive support text2img for Ming

a3c6d71

Signed-off-by: ZhengWG <zwg0606@gmail.com>

feat: support mult-statges

004f053

Signed-off-by: ZhengWG <zwg0606@gmail.com>

fix: fix ming import error

eb1cde3

Signed-off-by: ZhengWG <zwg0606@gmail.com>

clean code

1e71f08

Signed-off-by: ZhengWG <zwg0606@gmail.com>

refactor: del useless code

1a71dbf

Signed-off-by: ZhengWG <zwg0606@gmail.com>

fix: add prompt util for Ming

a93880b

Signed-off-by: ZhengWG <zwg0606@gmail.com>

fix: support sample-params for Ming

3c4cd6e

Signed-off-by: ZhengWG <zwg0606@gmail.com>

feat: support negative-prompt

8b5f4ae

Signed-off-by: ZhengWG <zwg0606@gmail.com>

feat: support ref-image

b6dce22

Signed-off-by: ZhengWG <zwg0606@gmail.com>

feat: add byte5 support for Ming

8aef154

Signed-off-by: ZhengWG <zwg0606@gmail.com>

fix: fix sampling_pars support

769ed0d

Signed-off-by: ZhengWG <zwg0606@gmail.com>

fix: fix ref image-edit

4b46c01

Signed-off-by: ZhengWG <zwg0606@gmail.com>

fix: fix byt5 encoder for Ming

c055de9

Signed-off-by: ZhengWG <zwg0606@gmail.com>

ZhengWG force-pushed the main branch from a6cbeb2 to 9a60e11 Compare May 12, 2026 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage#6

[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage#6
ZhengWG wants to merge 17 commits into
mainfrom
cursor/ming-omni-image-generation-6d83

ZhengWG commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZhengWG commented Apr 26, 2026

Purpose

Usage

What's added

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants