Skip to content

[Model] Extend NPU support for HunyuanImage3 Diffusion Model#1689

Merged
gcanlin merged 6 commits into
vllm-project:mainfrom
Semmer2:HunyuanImage3_npu_0.16.0_and_ep
Mar 12, 2026
Merged

[Model] Extend NPU support for HunyuanImage3 Diffusion Model#1689
gcanlin merged 6 commits into
vllm-project:mainfrom
Semmer2:HunyuanImage3_npu_0.16.0_and_ep

Conversation

@ElleElleWu
Copy link
Copy Markdown
Contributor

@ElleElleWu ElleElleWu commented Mar 5, 2026

Co-authored-by: skf1999 13234016272@163.com
Co-authored-by: Just-it 1161406585@qq.com
Co-authored-by: Semmer2 semmer@live.cn

Purpose

Support HunyuanImage as a DiT model in both GPU and NPU.

Test Result

1. Test Environment

GPU

CUDA         Version: 12.9
torch        Version: 2.9.1
vllm         Version: 0.16.0
vllm-omni    Version: 0.16.0

NPU

torch             Version: 2.9.0
torch_npu         Version: 2.9.0
vllm              Version: 0.16.0
vllm-ascend       Version: 0.16.0
vllm-omni         Version: 0.16.0

vllm-ascend: As there is no official release available for 0.16.0 yet, we have pinned the dependency to commit c7fd7a2 ([Doc][Misc] Fix msprobe_guide.md documentation issues (#6965)).

2. Offline inference

- CMD

/usr/local/python3.11.14/bin/python -u vllm-omni/examples/offline_inference/text_to_image/text_to_image.py \
    --model /mnt/share/HunyuanImage-3.0/ \
    --prompt "A brown and white dog is running on the grass" \
    --output output_image.png \
    --num-inference-steps 50 \
    --tensor-parallel-size 4 \
    --seed 1234 2>&1 \
    --enable-expert-parallel

- Execution Result Output

[Stage-0] INFO 03-05 12:03:34 [diffusion_engine.py:80] Generation completed successfully.
[Stage-0] INFO 03-05 12:03:34 [diffusion_engine.py:98] Post-processing completed in 0.0000 seconds


Processed prompts: 100%|██████████| 1/1 [00:28<00:00, 28.10s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]�[A
Processed prompts: 100%|██████████| 1/1 [00:28<00:00, 28.10s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]

Adding requests:   0%|          | 0/1 [00:28<?, ?it/s]
Total generation time: 28.1028 seconds (28102.85 ms)
INFO 03-05 12:03:34 [text_to_image.py:407] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_0dc5fda8-537c-4420-887f-46e0ede5a511', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt={'prompt': 'A brown and white dog is running on the grass', 'negative_prompt': None, 'additional_information': {'global_request_id': ['0_0dc5fda8-537c-4420-887f-46e0ede5a511']}}, latents=None, metrics={'image_num': 1, 'resolution': 640, 'postprocess_time_ms': 0.002384185791015625}, multimodal_output={})], images=[], prompt=None, latents=None, metrics={}, multimodal_output={})]
Saved generated image to output_image.png
[Stage-0] INFO 03-05 12:03:34 [omni_stage.py:870] Received shutdown signal

off

3. Online Inference

- command

vllm serve "/data/HunyuanImage-3.0/" --omni --port "8091" --tensor_parallel_size 8  --enable-expert-parallel

- Online Request

curl -X POST http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A brown and white dog is running on the grass",
    "num_inference_steps": 50,
    "n": 4,
    "size": "1024x1024",
    "seed": 123
  }' | jq -r '.data[0].b64_json' | base64 -d > dragon.png
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2098k  100 2098k  100   152  82732      5  0:00:30  0:00:25  0:00:05  560k```

- Execution Result Output

(APIServer pid=1177212) INFO 03-05 12:06:56 [api_server.py:1038] Generating 4 image(s) 1024x1024
(APIServer pid=1177212) INFO 03-05 12:06:56 [async_omni.py:345] [AsyncOrchestrator] Entering scheduling loop: stages=1, final_stage=0
[Stage-0] INFO 03-05 12:06:56 [manager.py:592] Deactivating all adapters: 0 layers
[Stage-0] INFO 03-05 12:06:56 [manager.py:592] Deactivating all adapters: 0 layers
[Stage-0] INFO 03-05 12:06:56 [manager.py:592] Deactivating all adapters: 0 layers
[Stage-0] INFO 03-05 12:06:56 [manager.py:592] Deactivating all adapters: 0 layers
[Stage-0] WARNING 03-05 12:06:56 [kv_transfer_manager.py:421] No connector available for receiving KV cache
[Stage-0] WARNING 03-05 12:06:56 [kv_transfer_manager.py:421] No connector available for receiving KV cache
[Stage-0] WARNING 03-05 12:06:56 [kv_transfer_manager.py:421] No connector available for receiving KV cache
[Stage-0] WARNING 03-05 12:06:56 [kv_transfer_manager.py:421] No connector available for receiving KV cache
  2%|███▏                                                                                                                                                         | 1/50 [00:00<00:25,  1.96it/s][rank0]:[W305 12:06:57.969502890 compiler_depend.ts:4658] Warning: The current allgather operator has a defect in handling different tensor shape,         the work event forces a wait operation, and the allgather wait on the python side would be fake (function operator())
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:24<00:00,  2.03it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:24<00:00,  2.03it/s]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:24<00:00,  2.03it/s]
[Stage-0] INFO 03-05 12:07:21 [diffusion_engine.py:80] Generation completed successfully.
[Stage-0] INFO 03-05 12:07:21 [diffusion_engine.py:98] Post-processing completed in 0.0000 seconds
(APIServer pid=1177212) INFO 03-05 12:07:21 [api_server.py:1058] Successfully generated 1 image(s)
(APIServer pid=1177212) INFO:     127.0.0.1:52494 - "POST /v1/images/generations HTTP/1.1" 200 OK

on


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b619c2d1f7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/distributed/parallel_state.py Outdated
Comment thread vllm_omni/diffusion/models/hunyuan_image_3/hunyuan_fused_moe.py Outdated
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

can you edit the title of this PR? we have already supported DiT inference for HYImage3. This should be an enhancement.

Comment thread vllm_omni/diffusion/data.py Outdated
Comment thread vllm_omni/diffusion/models/hunyuan_image_3/hunyuan_fused_moe.py Outdated
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! BTW, the online test looks like forgetting to enable EP. It's better to cover it.

Feel free to ping me again if you're blocking the hardware dispatch design. I will try to help when I have more bandwidth.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Rating: 7/10 | Verdict: ⚠️ Changes Requested

Summary

Solid implementation adding HunyuanImage3 support with GPU and NPU compatibility. However, missing critical documentation and tests. Code quality issues in MoE implementation need addressing.

Issues

  1. Missing documentation: No update to supported_models.md or example configs for the new model.

  2. No unit tests: 263 lines of new code without any tests. At minimum, need tests for:

    • is_moe property in OmniDiffusionConfig
    • Expert parallel initialization error cases
    • FusedMoE wrapper behavior
  3. Memory requirements undocumented: HunyuanImage3 requires significant VRAM (40GB per skill docs), but PR doesn't specify minimum requirements or recommended configurations.

  4. Stage config missing: No YAML config file for stage setup, which is required for new model support per PR checklist.

Highlights

  • ✅ Comprehensive GPU and NPU implementation
  • ✅ Excellent test coverage in PR description (offline + online)
  • ✅ Proper expert parallelism support for MoE layers
  • ✅ Good error handling for non-MoE models with EP enabled

Recommendation

Address documentation and test gaps before merge. Code implementation is solid but needs supporting artifacts.


Reviewed by OpenClaw with vllm-omni-skills 🦐

Comment thread vllm_omni/diffusion/models/hunyuan_image_3/hunyuan_fused_moe.py
Comment thread vllm_omni/diffusion/models/hunyuan_image_3/hunyuan_fused_moe.py Outdated
Comment thread vllm_omni/diffusion/distributed/parallel_state.py
Comment thread vllm_omni/diffusion/data.py
Comment thread examples/offline_inference/text_to_image/text_to_image.py
@ElleElleWu ElleElleWu changed the title [Model] Support HunyuanImage3 Diffusion Model in for GPU and NPU [Model] Futher Support HunyuanImage3 Diffusion Model in NPU Mar 6, 2026
@ElleElleWu ElleElleWu changed the title [Model] Futher Support HunyuanImage3 Diffusion Model in NPU [Model] Extend NPU support for HunyuanImage3 Diffusion Model Mar 6, 2026
@ElleElleWu
Copy link
Copy Markdown
Contributor Author

can you edit the title of this PR? we have already supported DiT inference for HYImage3. This should be an enhancement.

solved, please check

@ElleElleWu ElleElleWu force-pushed the HunyuanImage3_npu_0.16.0_and_ep branch 6 times, most recently from e44311f to 373cf76 Compare March 9, 2026 03:02
@ElleElleWu ElleElleWu requested a review from hsliuustc0106 March 9, 2026 03:05
Comment thread vllm_omni/diffusion/models/hunyuan_image_3/hunyuan_fused_moe.py Outdated
@ElleElleWu ElleElleWu force-pushed the HunyuanImage3_npu_0.16.0_and_ep branch from 373cf76 to 6603fc7 Compare March 12, 2026 01:42
@ElleElleWu ElleElleWu requested a review from gcanlin March 12, 2026 02:06
ElleElleWu and others added 5 commits March 12, 2026 17:30
Co-authored-by: skf1999 <13234016272@163.com>
Co-authored-by: Just-it <1161406585@qq.com>
Co-authored-by: Semmer2 <semmer@live.cn>
Signed-off-by: ElleElleWu <1608928702@qq.com>
Signed-off-by: ElleElleWu <1608928702@qq.com>
Signed-off-by: ElleElleWu <1608928702@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Mar 12, 2026

@xuechendi @hsliuustc0106 I push a micro-refactor for clearer hardware dispatch in HunYuanFusedMoE layer. PTAL.

And wait for @ElleElleWu test again. Thanks!

@gcanlin gcanlin added the ready label to trigger buildkite CI label Mar 12, 2026
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just modified the dispatch. And other logic is good to me. Thanks for contributing!

@ElleElleWu
Copy link
Copy Markdown
Contributor Author

@xuechendi @hsliuustc0106 I push a micro-refactor for clearer hardware dispatch in HunYuanFusedMoE layer. PTAL.

And wait for @ElleElleWu test again. Thanks!

Thanks for the update! I've finished testing, and the results are consistent with the previous version. Looks good to me.

@gcanlin gcanlin enabled auto-merge (squash) March 12, 2026 11:41
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

CI Gate: ✅ All gates passed (DCO, pre-commit, build, mergeable)


Highlights

  1. Clean Platform Dispatch Pattern — Factory pattern with get_diffusion_model_impl_qualname() hook avoids hardcoded GPU/NPU branches in model code.

  2. Comprehensive Test Coverage — New unit tests for HunyuanFusedMoE platform dispatch and is_moe property.

  3. Documentation Updatedsupported_models.md includes HunyuanImage3.


Suggestions

1. is_moe Threshold Change: > 1> 0

# Before
return num_experts > 1

# After
return num_experts > 0

Question: Should num_experts = 1 really be considered MoE? Typically MoE requires multiple experts for the routing mechanism to make sense.

If this is intentional (e.g., single-expert models use the same infrastructure), a brief comment explaining the rationale would help future readers.


2. Global State Cleanup in __del__

def __del__(self):
    if vllm_ascend_parallel_state._MC2:
        vllm_ascend_parallel_state._MC2.destroy()
    vllm_ascend_parallel_state._MC2 = None

Potential Issue: If multiple AscendHunyuanFusedMoE instances exist, the first one to be garbage-collected will destroy the shared _MC2 group, potentially causing crashes in remaining instances.

Suggestion: Consider reference counting or ownership model for the MC2 group lifecycle.


Summary

Aspect Rating
Architecture 8/10
Code Quality 8/10
Testing 9/10
Documentation 8/10

Verdict: ✅ Ready to merge after Buildkite CI passes. The suggestions above are non-blocking quality-of-life improvements.

Thanks for the NPU support contribution! 🚀

@hsliuustc0106 hsliuustc0106 self-requested a review March 12, 2026 12:49
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@gcanlin gcanlin merged commit 28dd1a6 into vllm-project:main Mar 12, 2026
7 checks passed
Fishermanykx pushed a commit to Fishermanykx/vllm-omni that referenced this pull request Mar 13, 2026
…oject#1689)

Signed-off-by: ElleElleWu <1608928702@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: skf1999 <13234016272@163.com>
Co-authored-by: Just-it <1161406585@qq.com>
Co-authored-by: Semmer2 <semmer@live.cn>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: KexiongYu <yukexiong1@huawei.com>
yiliu30 pushed a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 20, 2026
…oject#1689)

Signed-off-by: ElleElleWu <1608928702@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: skf1999 <13234016272@163.com>
Co-authored-by: Just-it <1161406585@qq.com>
Co-authored-by: Semmer2 <semmer@live.cn>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…oject#1689)

Signed-off-by: ElleElleWu <1608928702@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: skf1999 <13234016272@163.com>
Co-authored-by: Just-it <1161406585@qq.com>
Co-authored-by: Semmer2 <semmer@live.cn>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants