[Perf][2/n] Eliminate GPU<->CPU syncs in pooling code#41433
Conversation
Signed-off-by: Nick Hill <nickhill123@gmail.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Code Review
This pull request optimizes pooling operations by performing index calculations and metadata processing on the CPU to minimize GPU-CPU synchronizations. Key changes include generating segment IDs on the CPU for sequence-wise pooling and using CPU-based token IDs for masking in token-wise pooling. However, the refactoring of the token-wise pooling logic introduces a regression: replacing the explicit slicing and group offset logic with a direct torch.split on the full hidden_states tensor will likely cause runtime errors when the tensor contains tokens from multiple pooling groups or when the split sizes are empty.
Signed-off-by: Nick Hill <nickhill123@gmail.com>
…1433) Signed-off-by: Nick Hill <nickhill123@gmail.com>
…1433) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…1433) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…1433) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
Second batch of unnecessary gpu/cpu syncs, found via #40561.