[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage#6
Draft
ZhengWG wants to merge 17 commits into
Draft
[Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage#6ZhengWG wants to merge 17 commits into
ZhengWG wants to merge 17 commits into
Conversation
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Merges main (which now contains Ming-flash-omni-2.0 thinker (vllm-project#1822) and talker/TTS (vllm-project#2890) stages) into the image-generation feature branch. Conflict resolution highlights: - vllm_omni/diffusion/data.py::OmniDiffusionConfig.enrich_config: preserve the BailingMM2NativeForConditionalGeneration / ming_flash_omni branch that maps to MingImagePipeline (the diffusion-stage class for Ming image generation), and fold it into the upstream try/else structure that now also handles the diffusers adapter and nextstep. - vllm_omni/entrypoints/openai/serving_chat.py: keep the upstream OpenAI flow unchanged but reintroduce the Ming-specific <IMAGE> placeholder injection for img2img and the is_image_gen mm_processor flag (used by Ming's MM processor to expand query tokens). - vllm_omni/model_executor/models/ming_flash_omni/ming_flash_omni.py: drop the no-longer-needed talker NotImplementedError (talker is loaded directly via MingFlashOmniTalkerForConditionalGeneration on main), and document that imagegen runs as a separate diffusion stage. Keep the multi-stage load_weights with imagegen-prefix drop and query_tokens_dict tracking. - vllm_omni/model_executor/models/ming_flash_omni/prompt_utils.py: combine the image-gen query-token helpers (this branch) and the TTS caption builder (main) into a single module with a shared __all__. - vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py: combine expand_cfg_prompts / thinker2imagegen (this branch) and thinker2talker (main) into a single processor module so all Ming stage YAMLs (text-to-image dual, TTS, omni-speech) keep working. - vllm_omni/transformers_utils/configs/ming_flash_omni.py: retain both MingImageGenConfig (this branch) and MingFlashOmniTalkerConfig (main); register both with AutoConfig and expose both in MingFlashOmniConfig.sub_configs. Signed-off-by: cursor-agent <cursor-agent@cursor.sh> Co-authored-by: Zheng Wengang <zwg0606@gmail.com>
Document the thinker + diffusion (image generation) deployment path now that the image-gen stage merges back from the dual-stage YAML, per the Diffusion Model Requirements: - examples/offline_inference/ming_flash_omni/README.md and examples/online_serving/ming_flash_omni/README.md: add an 'Image generation (thinker + diffusion)' section with launch commands, text-to-image curl, image-edit (img2img) curl, and the full set of optional knobs accepted via extra_body (height/width, num_inference_steps, negative_prompt, cfg_text_scale, seed). - recipes/inclusionAI/Ming-flash-omni-2.0.md: register a fourth deployment mode (thinker + diffusion) with reference 5xH100 hardware recipe, full launch command, verification curl snippets, and notes about CFG companion expansion + img2img placeholder. - docs/models/supported_models.md: register Ming-flash-omni-2.0 with its three architectures (thinker AR, talker, MingImagePipeline) in the supported models table (CUDA only for now). - vllm_omni/diffusion/models/ming_flash_omni/__init__.py: restore the module-level docstring so 'import vllm_omni.diffusion.models.ming_flash_omni' is a regular package (not a namespace package). Signed-off-by: cursor-agent <cursor-agent@cursor.sh> Co-authored-by: Zheng Wengang <zwg0606@gmail.com>
…ation
Per the Diffusion Model Requirements, every newly added diffusion model
must ship at least an L4 *functionality* test (see
docs/contributing/ci/test_examples/l4_functionality_tests.inc.md). Ming
is a *normal priority* model in that matrix, so the new tests cover the
end-to-end thinker -> diffusion image-gen path with a small set of
distinct runtime configurations:
- tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py:
L4 expansion test parametrised over three online configs that each
exercise a different feature toggle on top of the dual-stage
ming_flash_omni_dual.yaml runtime:
* default (CFG companion via negative_prompt)
* cache_dit cache backend
* cfg_parallel_size=2
Plus a separate img2img case that validates the chat endpoint's
reference-image -> <IMAGE> placeholder injection and routing
through MingImagePipeline in img2img mode. Validation is delegated
to assert_diffusion_response (height/width/payload shape).
- tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py:
Offline counterpart that drives the same dual-stage pipeline through
the public Omni entrypoint, asserts the diffusion stage emits a PIL
image, and at run_level=advanced/full additionally checks the native
1024x1024 resolution.
- tests/e2e/stage_configs/bailingmm_moe_v2_lite_imagegen_ci.yaml: CI
stage config (load_format: dummy by default; stripped at higher run
levels) wiring the AR thinker on TP=4 + the MingImagePipeline
diffusion stage with shared_memory_connector cross-stage transport
and the expand_cfg_prompts / thinker2imagegen processors.
Signed-off-by: cursor-agent <cursor-agent@cursor.sh>
Co-authored-by: Zheng Wengang <zwg0606@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Adds the image generation (diffusion) stage for
inclusionAI/Ming-flash-omni-2.0, completing the model's text-to-image and image-edit (img2img) capabilities on top of the already-merged thinker (vllm-project#1822) and talker / TTS (vllm-project#2890) stages. The diffusion stage runsMingImagePipeline(Qwen2 connector + ZImage DiT + VAE) on top of the AR thinker's hidden states, exposed end-to-end through the OpenAI/v1/chat/completionsendpoint withmodalities: ["image"].This PR has been rebased / merged onto current
main(the prior development branch was based on a much older commit before vllm-project#1822 / vllm-project#2890 /[Config Refactor][2/N]landed). It supersedes the previous draft and is intentionally not linked to it.Key conflict-resolution highlights from the merge:
vllm_omni/diffusion/data.py::OmniDiffusionConfig.enrich_config: re-fold theBailingMM2NativeForConditionalGeneration/ming_flash_omni*->MingImagePipelinemapping into the new try/else structure that already handles the diffusers adapter andnextstep.vllm_omni/entrypoints/openai/serving_chat.py: keep the upstream OpenAI flow unchanged but reintroduce the Ming-specific<IMAGE>placeholder injection for img2img (so the thinker's ref-image substitution still fires when the multimodal cache is warm) and theis_image_genmm_processor_kwargsflag (used by Ming's MM processor to expand query tokens).vllm_omni/model_executor/models/ming_flash_omni/ming_flash_omni.py: keep the multi-stageload_weights(dropsimagegen.*weights — they're loaded byMingImagePipelinefrom its own subfolders — and reportsquery_tokens_dict.*as loaded since they're pre-loaded insideMingFlashOmniThinker.__init__).vllm_omni/model_executor/models/ming_flash_omni/prompt_utils.py: combine the image-gen query-token helpers with the TTS caption builder.vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py: combineexpand_cfg_prompts/thinker2imagegen(image-gen) withthinker2talker(TTS) into one processor module so all Ming stage YAMLs (text-to-image dual, TTS, omni-speech) keep working.vllm_omni/transformers_utils/configs/ming_flash_omni.py: retainMingImageGenConfig, register it withAutoConfig, and expose it alongsideMingFlashOmniTalkerConfiginMingFlashOmniConfig.sub_configs.Usage
Launch the dual-stage server (AR thinker on TP=4 across GPUs 0-3 + diffusion stage on GPU 4 by default; tweak
devicesin the YAML for your hardware):vllm serve Jonathan1909/Ming-flash-omni-2.0 \ --omni \ --stage-configs-path vllm_omni/model_executor/stage_configs/ming_flash_omni_dual.yaml \ --trust-remote-code \ --port 8188Text-to-image:
curl http://127.0.0.1:8188/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Jonathan1909/Ming-flash-omni-2.0", "messages": [{"role": "user", "content": "Please draw a cute cat."}], "modalities": ["image"] }' -o /tmp/ming_response.json python -c " import base64, json r = json.load(open('/tmp/ming_response.json')) url = r['choices'][0]['message']['content'][0]['image_url']['url'] png = base64.b64decode(url.split(',')[1]) open('/tmp/ming_cat.png', 'wb').write(png) print('PNG bytes:', len(png)) "Image edit (img2img) — include a reference image as the first content item; the chat endpoint detects the reference image, prepends
<IMAGE>to the prompt so the thinker's ref-image substitution still fires when the multimodal cache is warm, and routes throughMingImagePipelinein img2img mode.Optional generation knobs (
extra_body):height,widthnum_inference_stepsnegative_promptexpand_cfg_prompts.cfg_text_scaleseedWhat's added
Implementation:
vllm_omni/diffusion/models/ming_flash_omni/:MingImagePipeline(pipeline_ming.py), Qwen2 connector + RMSNorm (condition_encoder.py), ZImage transformer with ref-image fusion hooks (ming_zimage_transformer.py), optional ByT5 glyph encoder (byte5_encoder.py+t5_block_mapper.py), plus the standard__init__.py/ registry wiring.vllm_omni/diffusion/registry.py:MingImagePipelineand its post-process hook.vllm_omni/transformers_utils/configs/ming_flash_omni.py:MingImageGenConfig(subfolder layout,img_gen_scales, defaults from Ming's cookbook) registered withAutoConfig.vllm_omni/model_executor/stage_configs/ming_flash_omni_dual.yaml: the dual-stage runtime YAML wiring AR thinker -> shared-memory connector -> diffusion stage withexpand_cfg_prompts/thinker2imagegen.vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py:expand_cfg_prompts(CFG companion expansion whennegative_promptis set) andthinker2imagegen(slicefinal_hidden_statesat<imagePatch>positions, pack into the diffusion prompt, and emit a pairednegative_thinker_hidden_statesextra when a CFG companion output is present).vllm_omni/entrypoints/openai/serving_chat.py:<IMAGE>placeholder injection for img2img +is_image_genmm_processor_kwargsflag.Examples + docs (per Diffusion Model Requirements):
examples/offline_inference/ming_flash_omni/README.mdandexamples/online_serving/ming_flash_omni/README.md: new "Image generation (thinker + diffusion)" sections with launch commands, text-to-image / image-edit curl snippets, and the full optional-knobs table.recipes/inclusionAI/Ming-flash-omni-2.0.md: register a fourth deployment mode (thinker + diffusion) with reference 5×H100 hardware setup, full launch command, verification curl, and notes about CFG companion expansion + img2img placeholder.docs/models/supported_models.md: register Ming-flash-omni-2.0 with its three architectures (MingFlashOmniForConditionalGeneration,MingFlashOmniTalkerForConditionalGeneration,MingImagePipeline).Test Plan
L4 functionality tests added per
docs/contributing/ci/test_examples/l4_functionality_tests.inc.md(Ming is a normal priority model, so we use a small focused matrix):tests/e2e/online_serving/test_ming_flash_omni_imagegen_expansion.py(@pytest.mark.full_model @pytest.mark.diffusion @pytest.mark.omni):ming_imagegen_default: dual-stage default + CFG companion vianegative_prompt.ming_imagegen_cache_dit: same +--cache-backend cache_dit.ming_imagegen_cfg_parallel_2: same +--cfg-parallel-size 2.ming_imagegen_img2img_default: image edit with a synthetic ref image, validates the<IMAGE>placeholder injection path end-to-end.Validation is delegated to
assert_diffusion_response(height/width/payload shape).tests/e2e/offline_inference/test_ming_flash_omni_imagegen.py: drives the same dual-stage pipeline through the publicOmnientrypoint and asserts a PIL image is emitted; atrun_level={advanced_model, full_model}it also asserts the native 1024×1024 resolution.tests/e2e/stage_configs/bailingmm_moe_v2_lite_imagegen_ci.yaml: CI stage config (withload_format: dummyby default; the dummy format is stripped automatically at higherrun_levels bytests/helpers/fixtures/runtime.py).To run locally:
Test Result
Manual verification (text-to-image, single H100 cluster, 30 steps, 1024×1024, prompt "Please draw a cute cat") produces a coherent generated image; see the original PR description for a sample output. Image edit (img2img) end-to-end was validated against a synthetic 512×512 reference image with the watercolour-painting edit instruction. The L4 tests above exercise the runtime wiring with
load_format: dummyfor cheap CI runs and the full pixel path at higher run levels.cc @yuanheng-zhao @hsliuustc0106
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.