Enable Deepspeed Ulysses for Wan by mengker33 · Pull Request #353 · HabanaAI/optimum-habana-fork

mengker33 · 2025-10-23T06:26:34Z

No description provided.

Wei-Lin-Intel · 2025-10-31T10:52:39Z

+        start = sp_seq_len * parallel_state.get_sequence_parallel_rank()
+        end = sp_seq_len * (parallel_state.get_sequence_parallel_rank() + 1)
+
+        hidden_states = hidden_states[:, start:end, :]


Padded sequence length in SDPA without a mask would trigger the accuracy issue. It is necessary to add mask here if seq_len is padded. Use bool tensor for the mask can save some memory.

huijuanzh · 2025-11-05T03:06:15Z

LGTM, if update some performance data compared with 1/2/4 card would be good.

mengker33 · 2025-11-05T06:46:05Z

LGTM, if update some performance data compared with 1/2/4 card would be good.

Linwei suggested to use classic SP to reduce memory consumption. I will include SP in this PR too and provide perf data for both deepspeed ulysses and SP soon.

This bug has been fixed in upstream deepspeed but not included in habana deepspeed 1.22. See upstream fix commit: deepspeedai/DeepSpeed@ecb4bf3

When seq_len cannot be divided by cp_size, there is padding in seq_len. and attn_mask is helpful to reduce the padding impact on accuracy. Considering padding is usually very small compared to seq_len in the diffuser case, we don't apply the attn_mask by default. The attn_mask can be activated by setting env var CP_USE_MASK.

Co-authored-by: Wei Lin <wei2.lin@intel.com>

mengker33 · 2025-11-06T05:57:45Z

num_card	SP: generated_time	DS: generated_time
1	245s	245s
2	Avg: 145s	Avg:141s
4	Avg:74.75s	Avg: 71s
8	Avg: 59s	Avg: 55s

@Wei-Lin-Intel @huijuanzh Please help to review again, thank you:)

Wei-Lin-Intel

LGTM

huijuanzh · 2025-11-06T06:24:46Z

-            is_casual=False,
-            scale=None,
-            softmax_mode=softmax_mode,
+            query.transpose(1, 2).contiguous(),


why CogVideoXAttnProcessorGaudi need to change? Since you put qkv transpose in ModuleFusedSDPA, need to check other function which not use sp or ulysses.

huijuanzh · 2025-11-06T07:03:51Z

LGTM

ikurtchen reviewed Oct 31, 2025

View reviewed changes

Wei-Lin-Intel reviewed Oct 31, 2025

View reviewed changes

mengker33 force-pushed the oh_fork_wan_enable_cp branch 4 times, most recently from 556f5ef to ef2b847 Compare November 4, 2025 07:24

yingjie-han reviewed Nov 4, 2025

View reviewed changes

Comment thread optimum/habana/diffusers/models/attention_processor.py Outdated

mengker33 force-pushed the oh_fork_wan_enable_cp branch from ef2b847 to 9e27dc2 Compare November 4, 2025 08:00

Wei-Lin-Intel reviewed Nov 5, 2025

View reviewed changes

Comment thread optimum/habana/diffusers/models/attention_processor.py Outdated

Comment thread optimum/habana/diffusers/models/attention_processor.py Outdated

yingjie-han reviewed Nov 5, 2025

View reviewed changes

Comment thread optimum/habana/diffusers/models/wan_transformer_3d.py

mengker33 force-pushed the oh_fork_wan_enable_cp branch from 9e27dc2 to d1c12df Compare November 5, 2025 10:20

mengker33 and others added 9 commits November 6, 2025 04:10

Wan: Enable deepspeed ulysses for ti2v pipeline

d1dcfdd

Deepspeed: Fix uneven head sequence parallelism bug

1fd46b8

This bug has been fixed in upstream deepspeed but not included in habana deepspeed 1.22. See upstream fix commit: deepspeedai/DeepSpeed@ecb4bf3

README: Add Wan i2v example using deepspeed ulysses

21f1007

Enable deepspeed ulysses for Wan t2v pipeline

9a5fc12

README: Add Wan2.2 t2v example

86a67e5

Wan: Add traditional SP in wan attention

6c46e38

Co-authored-by: Wei Lin <wei2.lin@intel.com>

Wan: Replace vae WanBlockAttention with FusedSDPA

8372385

README: Update readme after adding SP support

c5aeffc

mengker33 force-pushed the oh_fork_wan_enable_cp branch from d1c12df to c5aeffc Compare November 6, 2025 04:22

mengker33 changed the title ~~Enable Deepspeed Ulysses for Wan i2v~~ Enable Deepspeed Ulysses for Wan Nov 6, 2025

Wei-Lin-Intel approved these changes Nov 6, 2025

View reviewed changes

huijuanzh reviewed Nov 6, 2025

View reviewed changes

Wei-Lin-Intel merged commit 4107752 into HabanaAI:aice/v1.22.0 Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Deepspeed Ulysses for Wan#353

Enable Deepspeed Ulysses for Wan#353
Wei-Lin-Intel merged 9 commits into
HabanaAI:aice/v1.22.0from
mengker33:oh_fork_wan_enable_cp

mengker33 commented Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wei-Lin-Intel Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huijuanzh commented Nov 5, 2025

Uh oh!

Uh oh!

mengker33 commented Nov 5, 2025

Uh oh!

mengker33 commented Nov 6, 2025 •

edited

Loading

Uh oh!

Wei-Lin-Intel left a comment

Uh oh!

huijuanzh Nov 6, 2025

Uh oh!

huijuanzh commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mengker33 commented Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wei-Lin-Intel Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huijuanzh commented Nov 5, 2025

Uh oh!

Uh oh!

mengker33 commented Nov 5, 2025

Uh oh!

mengker33 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wei-Lin-Intel left a comment

Choose a reason for hiding this comment

Uh oh!

huijuanzh Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

huijuanzh commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mengker33 commented Nov 6, 2025 •

edited

Loading