Skip to content

[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage#6

Draft
ZhengWG wants to merge 17 commits into
mainfrom
cursor/ming-omni-image-generation-6d83
Draft

[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage#6
ZhengWG wants to merge 17 commits into
mainfrom
cursor/ming-omni-image-generation-6d83

Conversation

@ZhengWG
Copy link
Copy Markdown
Owner

@ZhengWG ZhengWG commented Apr 26, 2026

Purpose

Adds the image generation (diffusion) stage for inclusionAI/Ming-flash-omni-2.0, completing the model's text-to-image and image-edit (img2img) capabilities on top of the already-merged thinker (vllm-project#1822) and talker / TTS (vllm-project#2890) stages. The diffusion stage runs MingImagePipeline (Qwen2 connector + ZImage DiT + VAE) on top of the AR thinker's hidden states, exposed end-to-end through the OpenAI /v1/chat/completions endpoint with modalities: ["image"].

This PR has been rebased / merged onto current main (the prior development branch was based on a much older commit before vllm-project#1822 / vllm-project#2890 / [Config Refactor][2/N] landed). It supersedes the previous draft and is intentionally not linked to it.

Key conflict-resolution highlights from the merge:

  • vllm_omni/diffusion/data.py::OmniDiffusionConfig.enrich_config: re-fold the BailingMM2NativeForConditionalGeneration / ming_flash_omni* -> MingImagePipeline mapping into the new try/else structure that already handles the diffusers adapter and nextstep.
  • vllm_omni/entrypoints/openai/serving_chat.py: keep the upstream OpenAI flow unchanged but reintroduce the Ming-specific <IMAGE> placeholder injection for img2img (so the thinker's ref-image substitution still fires when the multimodal cache is warm) and the is_image_gen mm_processor_kwargs flag (used by Ming's MM processor to expand query tokens).
  • vllm_omni/model_executor/models/ming_flash_omni/ming_flash_omni.py: keep the multi-stage load_weights (drops imagegen.* weights — they're loaded by MingImagePipeline from its own subfolders — and reports query_tokens_dict.* as loaded since they're pre-loaded inside MingFlashOmniThinker.__init__).
  • vllm_omni/model_executor/models/ming_flash_omni/prompt_utils.py: combine the image-gen query-token helpers with the TTS caption builder.
  • vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py: combine expand_cfg_prompts / thinker2imagegen (image-gen) with thinker2talker (TTS) into one processor module so all Ming stage YAMLs (text-to-image dual, TTS, omni-speech) keep working.
  • vllm_omni/transformers_utils/configs/ming_flash_omni.py: retain MingImageGenConfig, register it with AutoConfig, and expose it alongside MingFlashOmniTalkerConfig in MingFlashOmniConfig.sub_configs.

Usage

Launch the dual-stage server (AR thinker on TP=4 across GPUs 0-3 + diffusion stage on GPU 4 by default; tweak devices in the YAML for your hardware):

vllm serve Jonathan1909/Ming-flash-omni-2.0 \
    --omni \
    --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni_dual.yaml \
    --trust-remote-code \
    --port 8188

Text-to-image:

curl http://127.0.0.1:8188/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Jonathan1909/Ming-flash-omni-2.0",
        "messages": [{"role": "user", "content": "Please draw a cute cat."}],
        "modalities": ["image"]
    }' -o /tmp/ming_response.json

python -c "
import base64, json
r = json.load(open('/tmp/ming_response.json'))
url = r['choices'][0]['message']['content'][0]['image_url']['url']
png = base64.b64decode(url.split(',')[1])
open('/tmp/ming_cat.png', 'wb').write(png)
print('PNG bytes:', len(png))
"

Image edit (img2img) — include a reference image as the first content item; the chat endpoint detects the reference image, prepends <IMAGE> to the prompt so the thinker's ref-image substitution still fires when the multimodal cache is warm, and routes through MingImagePipeline in img2img mode.

Optional generation knobs (extra_body):

Field Default Notes
height, width 1024×1024 Target image resolution.
num_inference_steps 30 DiT denoise step count.
negative_prompt None Triggers the CFG companion via expand_cfg_prompts.
cfg_text_scale 2.0 Classifier-free-guidance scale.
seed request seed Deterministic generation.

What's added

Implementation:

  • vllm_omni/diffusion/models/ming_flash_omni/: MingImagePipeline (pipeline_ming.py), Qwen2 connector + RMSNorm (condition_encoder.py), ZImage transformer with ref-image fusion hooks (ming_zimage_transformer.py), optional ByT5 glyph encoder (byte5_encoder.py + t5_block_mapper.py), plus the standard __init__.py / registry wiring.
  • vllm_omni/diffusion/registry.py: MingImagePipeline and its post-process hook.
  • vllm_omni/transformers_utils/configs/ming_flash_omni.py: MingImageGenConfig (subfolder layout, img_gen_scales, defaults from Ming's cookbook) registered with AutoConfig.
  • vllm_omni/model_executor/stage_configs/ming_flash_omni_dual.yaml: the dual-stage runtime YAML wiring AR thinker -> shared-memory connector -> diffusion stage with expand_cfg_prompts / thinker2imagegen.
  • vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py: expand_cfg_prompts (CFG companion expansion when negative_prompt is set) and thinker2imagegen (slice final_hidden_states at <imagePatch> positions, pack into the diffusion prompt, and emit a paired negative_thinker_hidden_states extra when a CFG companion output is present).
  • vllm_omni/entrypoints/openai/serving_chat.py: <IMAGE> placeholder injection for img2img + is_image_gen mm_processor_kwargs flag.

Examples + docs (per Diffusion Model Requirements):

  • examples/offline_inference/ming_flash_omni/README.md and examples/online_serving/ming_flash_omni/README.md: new "Image generation (thinker + diffusion)" sections with launch commands, text-to-image / image-edit curl snippets, and the full optional-knobs table.
  • recipes/inclusionAI/Ming-flash-omni-2.0.md: register a fourth deployment mode (thinker + diffusion) with reference 5×H100 hardware setup, full launch command, verification curl, and notes about CFG companion expansion + img2img placeholder.
  • docs/models/supported_models.md: register Ming-flash-omni-2.0 with its three architectures (MingFlashOmniForConditionalGeneration, MingFlashOmniTalkerForConditionalGeneration, MingImagePipeline).

Test Plan

L4 functionality tests added per docs/contributing/ci/test_examples/l4_functionality_tests.inc.md (Ming is a normal priority model, so we use a small focused matrix):

  • tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py (@pytest.mark.full_model @pytest.mark.diffusion @pytest.mark.omni):
    • ming_imagegen_default: dual-stage default + CFG companion via negative_prompt.
    • ming_imagegen_cache_dit: same + --cache-backend cache_dit.
    • ming_imagegen_cfg_parallel_2: same + --cfg-parallel-size 2.
    • ming_imagegen_img2img_default: image edit with a synthetic ref image, validates the <IMAGE> placeholder injection path end-to-end.
      Validation is delegated to assert_diffusion_response (height/width/payload shape).
  • tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py: drives the same dual-stage pipeline through the public Omni entrypoint and asserts a PIL image is emitted; at run_level={advanced_model, full_model} it also asserts the native 1024×1024 resolution.
  • tests/e2e/stage_configs/bailingmm_moe_v2_lite_imagegen_ci.yaml: CI stage config (with load_format: dummy by default; the dummy format is stripped automatically at higher run_levels by tests/helpers/fixtures/runtime.py).

To run locally:

# Online serving — full L4 matrix
pytest -xvs tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py

# Offline E2E
pytest -xvs tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py

# Existing thinker + TTS coverage continues to run
pytest -xvs tests/e2e/offline_inference/test_ming_flash_omni.py
pytest -xvs tests/e2e/online_serving/test_ming_flash_omni.py

Test Result

Manual verification (text-to-image, single H100 cluster, 30 steps, 1024×1024, prompt "Please draw a cute cat") produces a coherent generated image; see the original PR description for a sample output. Image edit (img2img) end-to-end was validated against a synthetic 512×512 reference image with the watercolour-painting edit instruction. The L4 tests above exercise the runtime wiring with load_format: dummy for cheap CI runs and the full pixel path at higher run levels.

cc @yuanheng-zhao @hsliuustc0106


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.
Open in Web Open in Cursor 

ZhengWG and others added 17 commits April 19, 2026 16:08
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Merges main (which now contains Ming-flash-omni-2.0 thinker (vllm-project#1822) and
talker/TTS (vllm-project#2890) stages) into the image-generation feature branch.

Conflict resolution highlights:
- vllm_omni/diffusion/data.py::OmniDiffusionConfig.enrich_config:
  preserve the BailingMM2NativeForConditionalGeneration / ming_flash_omni
  branch that maps to MingImagePipeline (the diffusion-stage class for
  Ming image generation), and fold it into the upstream try/else
  structure that now also handles the diffusers adapter and nextstep.
- vllm_omni/entrypoints/openai/serving_chat.py: keep the upstream
  OpenAI flow unchanged but reintroduce the Ming-specific <IMAGE>
  placeholder injection for img2img and the is_image_gen mm_processor
  flag (used by Ming's MM processor to expand query tokens).
- vllm_omni/model_executor/models/ming_flash_omni/ming_flash_omni.py:
  drop the no-longer-needed talker NotImplementedError (talker is
  loaded directly via MingFlashOmniTalkerForConditionalGeneration on
  main), and document that imagegen runs as a separate diffusion
  stage. Keep the multi-stage load_weights with imagegen-prefix drop
  and query_tokens_dict tracking.
- vllm_omni/model_executor/models/ming_flash_omni/prompt_utils.py:
  combine the image-gen query-token helpers (this branch) and the TTS
  caption builder (main) into a single module with a shared __all__.
- vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py:
  combine expand_cfg_prompts / thinker2imagegen (this branch) and
  thinker2talker (main) into a single processor module so all Ming
  stage YAMLs (text-to-image dual, TTS, omni-speech) keep working.
- vllm_omni/transformers_utils/configs/ming_flash_omni.py:
  retain both MingImageGenConfig (this branch) and
  MingFlashOmniTalkerConfig (main); register both with AutoConfig
  and expose both in MingFlashOmniConfig.sub_configs.

Signed-off-by: cursor-agent <cursor-agent@cursor.sh>

Co-authored-by: Zheng Wengang <zwg0606@gmail.com>
Document the thinker + diffusion (image generation) deployment path now
that the image-gen stage merges back from the dual-stage YAML, per the
Diffusion Model Requirements:

- examples/offline_inference/ming_flash_omni/README.md and
  examples/online_serving/ming_flash_omni/README.md: add an
  'Image generation (thinker + diffusion)' section with launch
  commands, text-to-image curl, image-edit (img2img) curl, and the
  full set of optional knobs accepted via extra_body
  (height/width, num_inference_steps, negative_prompt,
  cfg_text_scale, seed).
- recipes/inclusionAI/Ming-flash-omni-2.0.md: register a fourth
  deployment mode (thinker + diffusion) with reference 5xH100 hardware
  recipe, full launch command, verification curl snippets, and
  notes about CFG companion expansion + img2img placeholder.
- docs/models/supported_models.md: register Ming-flash-omni-2.0 with
  its three architectures (thinker AR, talker, MingImagePipeline)
  in the supported models table (CUDA only for now).
- vllm_omni/diffusion/models/ming_flash_omni/__init__.py: restore the
  module-level docstring so 'import vllm_omni.diffusion.models.ming_flash_omni'
  is a regular package (not a namespace package).

Signed-off-by: cursor-agent <cursor-agent@cursor.sh>

Co-authored-by: Zheng Wengang <zwg0606@gmail.com>
…ation

Per the Diffusion Model Requirements, every newly added diffusion model
must ship at least an L4 *functionality* test (see
docs/contributing/ci/test_examples/l4_functionality_tests.inc.md). Ming
is a *normal priority* model in that matrix, so the new tests cover the
end-to-end thinker -> diffusion image-gen path with a small set of
distinct runtime configurations:

- tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py:
  L4 expansion test parametrised over three online configs that each
  exercise a different feature toggle on top of the dual-stage
  ming_flash_omni_dual.yaml runtime:
    * default (CFG companion via negative_prompt)
    * cache_dit cache backend
    * cfg_parallel_size=2
  Plus a separate img2img case that validates the chat endpoint's
  reference-image -> <IMAGE> placeholder injection and routing
  through MingImagePipeline in img2img mode. Validation is delegated
  to assert_diffusion_response (height/width/payload shape).
- tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py:
  Offline counterpart that drives the same dual-stage pipeline through
  the public Omni entrypoint, asserts the diffusion stage emits a PIL
  image, and at run_level=advanced/full additionally checks the native
  1024x1024 resolution.
- tests/e2e/stage_configs/bailingmm_moe_v2_lite_imagegen_ci.yaml: CI
  stage config (load_format: dummy by default; stripped at higher run
  levels) wiring the AR thinker on TP=4 + the MingImagePipeline
  diffusion stage with shared_memory_connector cross-stage transport
  and the expand_cfg_prompts / thinker2imagegen processors.

Signed-off-by: cursor-agent <cursor-agent@cursor.sh>

Co-authored-by: Zheng Wengang <zwg0606@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants