[Attention] Tune CUTLASS MLA num_splits#26846
Conversation
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new heuristic for determining num_splits in CUTLASS MLA to improve performance. The new logic is based on the ratio of sequence length to batch size. While this is a reasonable approach for performance tuning, my review has identified a critical concern. The change removes a safeguard that was in place to prevent kernel hangs when the batch size is greater than one. Reintroducing this hang would be a critical issue, and it's not clear from the pull request description if the underlying problem has been resolved. I have left a comment detailing this concern.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
LGTM! thanks for doing this!
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Purpose
Tune the num_splits heuristic for CUTLASS_MLA to achieve some speedup now that #26026 has fixed the hang. Based on experiments performed using the tools introduced in #26835, this is the optimal num_splits policy:
Following the optimal policy would yield this speedup:
As a simpler alternative, we implement a heuristic yielding the following policy:
This results in the following speedup:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.