[Feat] Support T5 Tensor Parallelism by yuanheng-zhao · Pull Request #1881 · vllm-project/vllm-omni

yuanheng-zhao · 2026-03-13T15:58:30Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Support T5 TP so that text encoder T5EncoderModel won't occupy memory in a replicated way when running model with multi-devices.

Test Plan

Unit tests added
Test and compare offline example test with different TP size on Flux.1-dev, check gpu memory usage.

Run DiT model black-forest-labs/FLUX.1-dev (single diffusion stage, no AR)

Note the HF repo is a gated repo so it requires access via

hf auth login
# input HF access token

Run offline e2e with 2 devices

cd examples/offline_inference/text_to_image

python text_to_image.py \
  --model black-forest-labs/FLUX.1-dev \
  --prompt "a chocolate cupcake on the table" \
  --seed 42 \
  --cfg-scale 4.0 \
  --num-images-per-prompt 1 \
  --num-inference-steps 50 \
  --height 1024 \
  --width 1024 \
  --tensor-parallel-size 2 \
  --output output_flux1dev_tp2.png

Test Result

pytest tests/diffusion/models/t5_encoder/test_t5_encoder_tp.py

...
========== 13 passed, 17 warnings in 20.87s =========

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

yuanheng-zhao · 2026-03-14T16:21:10Z

Partial logs indicating memory usage

Observe memory usage from "Model loading took ...", "Process-scoped GPU memory after model loading..." from logs

Non-TP

[Stage-0] INFO 03-14 16:11:44 [diffusers_loader.py:315] Loading weights took 15.81 seconds
[Stage-0] INFO 03-14 16:11:45 [diffusion_model_runner.py:133] Model loading took 31.4327 GiB and 20.820389 seconds
[Stage-0] INFO 03-14 16:11:45 [diffusion_model_runner.py:138] Model runner: Model loaded successfully.
[Stage-0] INFO 03-14 16:11:45 [diffusion_model_runner.py:78] Model runner: transformer compiled with torch.compile.
[Stage-0] INFO 03-14 16:11:45 [diffusion_model_runner.py:172] Model runner: Initialization complete.
[Stage-0] INFO 03-14 16:11:45 [diffusion_worker.py:148] Worker 0: Process-scoped GPU memory after model loading: 32.05 GiB.

TP size 2

with --tensor-parallel-size 2

[Stage-0] INFO 03-14 16:09:19 [diffusers_loader.py:315] Loading weights took 17.60 seconds
[Stage-0] INFO 03-14 16:09:19 [diffusion_model_runner.py:133] Model loading took 22.0661 GiB and 22.648527 seconds
[Stage-0] INFO 03-14 16:09:19 [diffusion_model_runner.py:138] Model runner: Model loaded successfully.
[Stage-0] INFO 03-14 16:09:19 [diffusion_model_runner.py:78] Model runner: transformer compiled with torch.compile.
[Stage-0] INFO 03-14 16:09:19 [diffusion_model_runner.py:172] Model runner: Initialization complete.
[Stage-0] INFO 03-14 16:09:19 [diffusion_worker.py:148] Worker 1: Process-scoped GPU memory after model loading: 22.71 GiB.

TP size 4

with --tensor-parallel-size 4

[Stage-0] INFO 03-14 16:07:14 [diffusers_loader.py:315] Loading weights took 21.98 seconds
[Stage-0] INFO 03-14 16:07:14 [diffusion_model_runner.py:133] Model loading took 17.2878 GiB and 27.025690 seconds
[Stage-0] INFO 03-14 16:07:14 [diffusion_model_runner.py:138] Model runner: Model loaded successfully.
[Stage-0] INFO 03-14 16:07:14 [diffusion_model_runner.py:78] Model runner: transformer compiled with torch.compile.
[Stage-0] INFO 03-14 16:07:14 [diffusion_model_runner.py:172] Model runner: Initialization complete.
[Stage-0] INFO 03-14 16:07:14 [diffusion_worker.py:148] Worker 2: Process-scoped GPU memory after model loading: 18.01 GiB.

yuanheng-zhao · 2026-03-14T16:28:40Z

Output images

no-tp	tp size 2	tp size 4

yuanheng-zhao · 2026-03-15T06:50:19Z

cc @princepride , @gcanlin

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9ab3a36eb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

princepride · 2026-03-15T07:59:13Z

+from vllm.model_executor.model_loader.weight_utils import default_weight_loader
+
+
+class T5LayerNorm(nn.Module):


Can we directly use vLLM's RMSNorm: https://github.com/vllm-project/vllm/blob/143e4dccdfd8293c70c76f8d32a60ce23ecc23ea/vllm/model_executor/layers/layernorm.py#L130

vLLM RMSNorm has precision discrepancies compared with the original T5 implementation due to optimized kernel. The previous T5LayerNorm casts to bf16 before multiplying by weight, while vLLM RMSNorm keeps fp32 through the weight multiply and only casts at the very end.

vLLM RMSNorm is expected to be more numeraically accurate/stable compared with previous T5 impl, however, from testing on flux.dev-1, it outputs image with lower quality (weird)

with vLLM RMSNorm (tp1, tp2)

Interesting, so we had better revert the change

Reverted in d8a9823: use T5 layernorm for now

princepride · 2026-03-15T08:05:07Z

@congw729 I'm not sure whether we need unit test for this module? I remember in vLLM, we don't have unit test for some specific module like siglip.

@congw729 I'm not sure whether we need unit test for this module? I remember in vLLM, we don't have unit test for some specific module like siglip.

I think it's okay to have this test.

@congw729 I'm not sure whether we need unit test for this module? I remember in vLLM, we don't have unit test for some specific module like siglip.

I think it's okay to have this test.

Thanks, could you please double-check to make sure this unit test meets the specifications?

@congw729 I'm not sure whether we need unit test for this module? I remember in vLLM, we don't have unit test for some specific module like siglip.

I think it's okay to have this test.

Thanks, could you please double-check to make sure this unit test meets the specifications?

LGTM. The test cases cover most scenarios, the marks are correctly labeled, and the time cost is also within tolerance.

congw729 · 2026-03-16T06:44:05Z

Do we need to modify the related doc?

yuanheng-zhao · 2026-03-16T08:36:05Z

Do we need to modify the related doc?

@congw729 It seems in tensor_parallel feature design doc we only add complete model pipeline as examples
https://github.com/vllm-project/vllm-omni/blob/main/docs/design/feature/tensor_parallel.md?plain=1#L260-L266

While T5EncoderModel is usually one of components of a model, it seems unnecessary to note T5 supports TP by itself. We could create a TP layers section later when we have more custom TP layers supported.

princepride

LGTM

congw729 · 2026-03-16T08:37:50Z

Do we need to modify the related doc?

@congw729 It seems in tensor_parallel feature design doc we only add complete model pipeline to supported models https://github.com/vllm-project/vllm-omni/blob/main/docs/design/feature/tensor_parallel.md?plain=1#L260-L266

While T5EncoderModel is usually one of components of a model, it seems unnecessary to note T5 supports TP by itself. We could create a TP layers section later when we have more custom TP layers supported.

Good to know

congw729

LGTM

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-03-18T15:05:12Z

cc @hsliuustc0106

gcanlin · 2026-03-19T12:12:56Z

@yuanheng-zhao Could you fix the conflicts please?

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao · 2026-03-19T13:05:05Z

@yuanheng-zhao Could you fix the conflicts please?

@gcanlin Conflicts have been fixed.

gcanlin

LGTM, thanks!

gcanlin · 2026-03-19T13:08:40Z

QQ: can other models enable T5 TP by reusing T5EncoderModel?

yuanheng-zhao · 2026-03-19T13:31:02Z

QQ: can other models enable T5 TP by reusing T5EncoderModel?

@gcanlin Yes, models within diffusion loader/runner scope of vllm-omni can directly apply the TP versioned T5.

For example, transformers.T5EncoderModel in vllm_omni/diffusion/models/glm_image/pipeline_glm_image.py can be replaced. (<- this is not applied at this time because the upstream dependency versions for running GLM-Image is mismatching, aka vllm transformers mismatch, - have to install a diff ver of transformers and re-install vllm-omni)

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

gcanlin · 2026-03-20T00:48:56Z

QQ: can other models enable T5 TP by reusing T5EncoderModel?

@gcanlin Yes, models within diffusion loader/runner scope of vllm-omni can directly apply the TP versioned T5.

For example, transformers.T5EncoderModel in vllm_omni/diffusion/models/glm_image/pipeline_glm_image.py can be replaced. (<- this is not applied at this time because the upstream dependency versions for running GLM-Image is mismatching, aka vllm transformers mismatch, - have to install a diff ver of transformers and re-install vllm-omni)

I think it deserves a doc so that other model developers can be aware of how to reuse it. Will merge this PR first. Thanks.

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>

### vllm-omni-audio-tts - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-perf - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-api - Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection - Changes: - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection ### vllm-omni-contrib - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-cicd - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-api - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-perf - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-contrib - Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 ### vllm-omni-serving - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-contrib - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-api - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0` ### vllm-omni-cicd - Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test. - Changes: - Bug fix: [CI] Fix test. ### vllm-omni-cicd - Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml` ### vllm-omni-cicd - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-perf - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-serving - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-image-gen - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-perf - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-serving - Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni - Changes: - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni ### vllm-omni-image-gen - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images - Additions: - Qwen-Image-Layered - Qwen-Image-Layered - Qwen-Image-Layered ### vllm-omni-api - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images ### vllm-omni-cicd - Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) ### vllm-omni-serving - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-audio-tts - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-perf - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-serving - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-api - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-serving - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-cicd - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-api - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Additions: - `/v1/chat/completions` ### vllm-omni-perf - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) ### vllm-omni-contrib - Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case) - Changes: - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case) ### vllm-omni-contrib - Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support ### vllm-omni-cicd - Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash - Changes: - Bug fix: Fix Base voice clone streaming quality and stop-token crash ### vllm-omni-cicd - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models - Changes: - Bug fix: Fix OmniGen2 transformer config loading for HF models ### vllm-omni-audio-tts - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-perf - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-audio-tts - Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated ### vllm-omni-contrib - Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models - Changes: - New feature: [Docs] Add Wan2.1-T2V as supported video generation models ### vllm-omni-video-gen - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-perf - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-audio-tts - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-perf - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-api - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-perf - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-contrib - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-serving - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-cicd - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-image-gen - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-contrib - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-distributed - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-quantization - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-cicd - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-perf - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-contrib - Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0 ### vllm-omni-contrib - Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section - Changes: - New feature: Add `Governance` section ### vllm-omni-distributed - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism ### vllm-omni-cicd - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism

yuanheng-zhao force-pushed the model/t5-tp branch from 03071d0 to bb47030 Compare March 14, 2026 16:03

yuanheng-zhao changed the title ~~[WIP][Feat] Support T5 Tensor Parallelism~~ [Feat] Support T5 Tensor Parallelism Mar 15, 2026

yuanheng-zhao marked this pull request as ready for review March 15, 2026 06:48

yuanheng-zhao requested a review from hsliuustc0106 as a code owner March 15, 2026 06:48

chatgpt-codex-connector Bot reviewed Mar 15, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/glm_image/pipeline_glm_image.py Outdated

Comment thread vllm_omni/diffusion/models/glm_image/pipeline_glm_image.py Outdated

princepride requested changes Mar 15, 2026

View reviewed changes

yuanheng-zhao force-pushed the model/t5-tp branch from 4a9e7a6 to d8a9823 Compare March 16, 2026 07:16

princepride approved these changes Mar 16, 2026

View reviewed changes

congw729 approved these changes Mar 16, 2026

View reviewed changes

yuanheng-zhao added 9 commits March 18, 2026 20:47

add tp T5

3353237

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

apply in pipelien flux

f6d6210

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd

b8b75e5

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

upd T5 model

e6f5f30

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

upd usage

db4d3ec

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com>

add unit tests for T5 tp

96d9739

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

revert glm-image T5-tp usage

a224b96

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

apply model level load_weights

1bf1912

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

use vllm RMSNorm

49dfcce

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

yuanheng-zhao force-pushed the model/t5-tp branch from d8a9823 to 49dfcce Compare March 18, 2026 12:50

Merge main

955fbb3

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

gcanlin added the ready label to trigger buildkite CI label Mar 19, 2026

gcanlin approved these changes Mar 19, 2026

View reviewed changes

trim L1 tests

e5400f6

Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

gcanlin merged commit 6ac7ba6 into vllm-project:main Mar 20, 2026
7 checks passed

zhumingjue138 pushed a commit to zhumingjue138/vllm-omni that referenced this pull request Mar 20, 2026

[Feat] Support T5 Tensor Parallelism (vllm-project#1881)

2af3a4b

Signed-off-by: yuanheng <jonathan.zhaoyh@gmail.com> Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>

		from vllm.model_executor.model_loader.weight_utils import default_weight_loader


		class T5LayerNorm(nn.Module):

Conversation

yuanheng-zhao commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

yuanheng-zhao commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Partial logs indicating memory usage

Non-TP

TP size 2

TP size 4

Uh oh!

yuanheng-zhao commented Mar 14, 2026

Output images

Uh oh!

yuanheng-zhao commented Mar 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

congw729 commented Mar 16, 2026

Uh oh!

yuanheng-zhao commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

princepride left a comment

Choose a reason for hiding this comment

Uh oh!

congw729 commented Mar 16, 2026

Uh oh!

congw729 left a comment

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao commented Mar 18, 2026

Uh oh!

gcanlin commented Mar 19, 2026

Uh oh!

yuanheng-zhao commented Mar 19, 2026

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Mar 19, 2026

Uh oh!

yuanheng-zhao commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcanlin commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

yuanheng-zhao commented Mar 13, 2026 •

edited

Loading

yuanheng-zhao commented Mar 14, 2026 •

edited

Loading

yuanheng-zhao Mar 15, 2026 •

edited

Loading

yuanheng-zhao commented Mar 16, 2026 •

edited

Loading

yuanheng-zhao commented Mar 19, 2026 •

edited

Loading