Skip to content

Exclude IndexSelectOp from inlineMost#4266

Merged
jjsjann123 merged 10 commits intomainfrom
jjsjann123/fixing_vectorized_load_for_index_select
Apr 30, 2025
Merged

Exclude IndexSelectOp from inlineMost#4266
jjsjann123 merged 10 commits intomainfrom
jjsjann123/fixing_vectorized_load_for_index_select

Conversation

@jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Apr 17, 2025

This is to restore a performance regression for pointwise kernel containing IndexSelectOp.

IndexSelectOp supports vectorized load. Excluding it from inlineMost allows a codegen change from

array<vector_factor> buffer;
for unroll:
    vec_load(buffer)  // IndexSelectOp
    ... // computation consuming value of buffer

to

array<vector_factor * unroll> buffer;
for unroll:
    vec_load(&buffer[i])  // IndexSelectOp
for unroll:
    ... // computation consuming value of buffer

tasks:

  • code comment
  • add benchmark numbers in PR

@github-actions
Copy link

github-actions bot commented Apr 17, 2025

Review updated until commit 222bc69

Description

  • Exclude IndexSelectOp from inlineMost for performance

  • Aggregate allocation of manual unroll ID and inner ID

  • Add comment explaining the change


Changes walkthrough 📝

Relevant files
Enhancement
pointwise.cpp
Exclude IndexSelectOp from inlineMost                                       

csrc/scheduler/pointwise.cpp

  • Add logic to exclude IndexSelectOp output from inner_most_tensors
  • Add comment explaining the rationale behind the change
  • +8/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Performance Impact

    The change excludes IndexSelectOp from inlineMost, which may have implications on performance. Ensure that the performance metrics provided in the PR description accurately reflect the intended performance gains and that there are no unintended regressions.

    // IndexSelectOp reads lookup tv without cache. Because pointwise scheduler
    // doesn't use ParallelType::Unroll, we need to exclude consumer of fusion
    // inputs to be inlineMost. This allows us to aggregate the allocation of
    // manual unroll ID and its inner ID.
    for (auto idx_sel : ir_utils::getOpsOfType<IndexSelectOp>(fusion)) {
      inner_most_tensors.erase(idx_sel->output(0)->as<TensorView>());
    }

    @jjsjann123
    Copy link
    Collaborator Author

    image

    Comparing the performance across the embedding fwd python benchmark. This is the speedup we get from this PR.

    @jjsjann123 jjsjann123 changed the title Jjsjann123/fixing vectorized load for index select Exclude IndexSelectOp from inlineMost Apr 18, 2025
    @jjsjann123 jjsjann123 marked this pull request as ready for review April 18, 2025 17:29
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 requested a review from naoyam April 22, 2025 18:24
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 requested a review from liqiangxl April 30, 2025 09:48
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123
    Copy link
    Collaborator Author

    @naoyam updated the comment, this one should be good to go.

    Copy link
    Collaborator

    @naoyam naoyam left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    @jjsjann123 jjsjann123 merged commit a8ad0bc into main Apr 30, 2025
    52 of 53 checks passed
    @jjsjann123 jjsjann123 deleted the jjsjann123/fixing_vectorized_load_for_index_select branch April 30, 2025 20:57
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants