Skip to content

[Performance]: Reevaluate padding requirement for Sequence Parallelism #29136

@ProExpertProg

Description

@ProExpertProg

Proposal to improve performance

Currently, we pad num_tokens to a multiple of TP (SP) size when Sequence Parallelism (SP) is enabled. However, we should benchmark to see the performance reduction in padding num_tokens and compare it to just manually padding with -num_tokens % tp_size around the sequence parallel section, or by doing uneven work across TP ranks by manipulating the sizes returned by reduce_scatter (more complicated but theoretically the best performance).

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Type

No type

Projects

Status

Ready

Relationships

None yet

Development

No branches or pull requests

Issue actions