Enable Deepspeed Ulysses for Wan#353
Conversation
| start = sp_seq_len * parallel_state.get_sequence_parallel_rank() | ||
| end = sp_seq_len * (parallel_state.get_sequence_parallel_rank() + 1) | ||
|
|
||
| hidden_states = hidden_states[:, start:end, :] |
There was a problem hiding this comment.
Padded sequence length in SDPA without a mask would trigger the accuracy issue. It is necessary to add mask here if seq_len is padded. Use bool tensor for the mask can save some memory.
556f5ef to
ef2b847
Compare
ef2b847 to
9e27dc2
Compare
|
LGTM, if update some performance data compared with 1/2/4 card would be good. |
Linwei suggested to use classic SP to reduce memory consumption. I will include SP in this PR too and provide perf data for both deepspeed ulysses and SP soon. |
9e27dc2 to
d1c12df
Compare
This bug has been fixed in upstream deepspeed but not included in habana deepspeed 1.22. See upstream fix commit: deepspeedai/DeepSpeed@ecb4bf3
When seq_len cannot be divided by cp_size, there is padding in seq_len. and attn_mask is helpful to reduce the padding impact on accuracy. Considering padding is usually very small compared to seq_len in the diffuser case, we don't apply the attn_mask by default. The attn_mask can be activated by setting env var CP_USE_MASK.
Co-authored-by: Wei Lin <wei2.lin@intel.com>
d1c12df to
c5aeffc
Compare
@Wei-Lin-Intel @huijuanzh Please help to review again, thank you:) |
| is_casual=False, | ||
| scale=None, | ||
| softmax_mode=softmax_mode, | ||
| query.transpose(1, 2).contiguous(), |
There was a problem hiding this comment.
why CogVideoXAttnProcessorGaudi need to change? Since you put qkv transpose in ModuleFusedSDPA, need to check other function which not use sp or ulysses.
|
LGTM |

No description provided.