Skip to content

Enable Deepspeed Ulysses for Wan#353

Merged
Wei-Lin-Intel merged 9 commits into
HabanaAI:aice/v1.22.0from
mengker33:oh_fork_wan_enable_cp
Nov 6, 2025
Merged

Enable Deepspeed Ulysses for Wan#353
Wei-Lin-Intel merged 9 commits into
HabanaAI:aice/v1.22.0from
mengker33:oh_fork_wan_enable_cp

Conversation

@mengker33
Copy link
Copy Markdown

No description provided.

Comment thread DeepSpeed/deepspeed/sequence/layer.py
Comment thread examples/stable-diffusion/image_to_video_generation.py
Comment thread optimum/habana/diffusers/models/attention_processor.py Outdated
Comment thread optimum/habana/diffusers/models/attention_processor.py
Comment thread optimum/habana/diffusers/pipelines/wan/pipeline_wan_i2v.py Outdated
Comment thread optimum/habana/diffusers/pipelines/wan/pipeline_wan_i2v.py Outdated
start = sp_seq_len * parallel_state.get_sequence_parallel_rank()
end = sp_seq_len * (parallel_state.get_sequence_parallel_rank() + 1)

hidden_states = hidden_states[:, start:end, :]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padded sequence length in SDPA without a mask would trigger the accuracy issue. It is necessary to add mask here if seq_len is padded. Use bool tensor for the mask can save some memory.

@mengker33 mengker33 force-pushed the oh_fork_wan_enable_cp branch 4 times, most recently from 556f5ef to ef2b847 Compare November 4, 2025 07:24
Comment thread optimum/habana/diffusers/models/attention_processor.py Outdated
@mengker33 mengker33 force-pushed the oh_fork_wan_enable_cp branch from ef2b847 to 9e27dc2 Compare November 4, 2025 08:00
Comment thread optimum/habana/diffusers/models/attention_processor.py Outdated
Comment thread optimum/habana/diffusers/models/attention_processor.py Outdated
@huijuanzh
Copy link
Copy Markdown

LGTM, if update some performance data compared with 1/2/4 card would be good.

Comment thread optimum/habana/diffusers/models/wan_transformer_3d.py
@mengker33
Copy link
Copy Markdown
Author

LGTM, if update some performance data compared with 1/2/4 card would be good.

Linwei suggested to use classic SP to reduce memory consumption. I will include SP in this PR too and provide perf data for both deepspeed ulysses and SP soon.

@mengker33 mengker33 force-pushed the oh_fork_wan_enable_cp branch from 9e27dc2 to d1c12df Compare November 5, 2025 10:20
mengker33 and others added 9 commits November 6, 2025 04:10
This bug has been fixed in upstream deepspeed but not included in habana
deepspeed 1.22. See upstream fix commit:
deepspeedai/DeepSpeed@ecb4bf3
When seq_len cannot be divided by cp_size, there is padding in seq_len.
and attn_mask is helpful to reduce the padding impact on accuracy.
Considering padding is usually very small compared to seq_len in the
diffuser case, we don't apply the attn_mask by default.
The attn_mask can be activated by setting env var CP_USE_MASK.
Co-authored-by: Wei Lin <wei2.lin@intel.com>
@mengker33 mengker33 force-pushed the oh_fork_wan_enable_cp branch from d1c12df to c5aeffc Compare November 6, 2025 04:22
@mengker33 mengker33 changed the title Enable Deepspeed Ulysses for Wan i2v Enable Deepspeed Ulysses for Wan Nov 6, 2025
@mengker33
Copy link
Copy Markdown
Author

mengker33 commented Nov 6, 2025

image
num_card SP: generated_time DS: generated_time
1 245s 245s
2 Avg: 145s Avg:141s
4 Avg:74.75s Avg: 71s
8 Avg: 59s Avg: 55s

@Wei-Lin-Intel @huijuanzh Please help to review again, thank you:)

Copy link
Copy Markdown

@Wei-Lin-Intel Wei-Lin-Intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

is_casual=False,
scale=None,
softmax_mode=softmax_mode,
query.transpose(1, 2).contiguous(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why CogVideoXAttnProcessorGaudi need to change? Since you put qkv transpose in ModuleFusedSDPA, need to check other function which not use sp or ulysses.

@huijuanzh
Copy link
Copy Markdown

LGTM

@Wei-Lin-Intel Wei-Lin-Intel merged commit 4107752 into HabanaAI:aice/v1.22.0 Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants