Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not promote broadcast only groups #4154

Merged
merged 4 commits into from
Apr 1, 2025

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Mar 27, 2025

This PR attempts to not promote a loop group when it only consists of broadcast IDs. For example, consider a fusion as shown below:

  auto tv0 = makeContigConcreteTensor({-1, 1});
  fusion.addInput(tv0);
  auto tv1 = makeContigTensor(2);
  fusion.addInput(tv1);

  auto tv2 = set(tv0);
  auto tv3 = add(tv2, tv1);
  fusion.addOutput(tv3);

  for (auto tv : fusion.allTvs()) {
    tv->split(1, 1, false);
    tv->reorder({{0, 1}, {1, 0}});
  }

  for (auto tv : fusion.allTvs()) {
    tv->inlineAt(2);
  }
Inputs:
  T0_g_float[bS8{1}, iS0{i0}, bS9{1}]
  T1_g_float[iS12{1}, iS2{i4}, iS13{i5}]
Outputs:
  T3_g_float[iS14{1}, iS6{i0}, iS15{i5}] ca_pos( 2 ) produce_pos( 2 )

%kernel {
T2_l_float[bS10{1}, iS4{i0}, bS11{1}] ca_pos( 2 )
   = Set( T0_g_float[bS8{1}, iS0{i0}, bS9{1}], cache_op=Streaming )
T3_g_float[iS14{1}, iS6{i0}, iS15{i5}] ca_pos( 2 ) produce_pos( 2 )
   = T2_l_float[bS10{1}, iS4{i0}, bS11{1}] ca_pos( 2 )
   + T1_g_float[iS12{1}, iS2{i4}, iS13{i5}];

TransformPrinter :
T0_g_float[bS8{1}, iS0{i0}, bS9{1}]
 logical domain : (iS0{i0}, bS1{1})
 contiguity: t n
  Outer split: bS1{1} by factor 1 -> bS8{1}, bS9{1}
 loop domain : (bS8{1}, iS0{i0}, bS9{1})
T2_l_float[bS10{1}, iS4{i0}, bS11{1}] ca_pos( 2 )
 logical domain : (iS4{i0}, bS5{1})
 contiguity: t n
  Outer split: bS5{1} by factor 1 -> bS10{1}, bS11{1}
 loop domain : (bS10{1}, iS4{i0}, bS11{1})
T1_g_float[iS12{1}, iS2{i4}, iS13{i5}]
 logical domain : (iS2{i4}, iS3{i5})
 contiguity: t t
  Outer split: iS3{i5} by factor 1 -> iS12{1}, iS13{i5}
 loop domain : (iS12{1}, iS2{i4}, iS13{i5})
T3_g_float[iS14{1}, iS6{i0}, iS15{i5}] ca_pos( 2 ) produce_pos( 2 )
 logical domain : (iS6{i0}, iS7{i5})
 contiguity: t t
  Outer split: iS7{i5} by factor 1 -> iS14{1}, iS15{i5}
 loop domain : (iS14{1}, iS6{i0}, iS15{i5})
} // %kernel

Here, the interesting part is the innermost loop ID of T2, bS11{1}. Because bS5 is promoted to iS3 (or iS7), bS11 is also promoted to a non-broadcast ID that is exact mapped with iS13 and iS15. However, in this case, bS11 doesn't really need to be promoted. More specifically, as long as a loop group only consists of broadcast IDs, the group should not need to be promoted.

Currently, the generated CUDA kernel with NVFUSER_ENABLE=id_model(all) looks like below:

__global__ void CUDAGeneratedKernel(Tensor<float, 2, 2> T0, Tensor<float, 2, 2> T1, Tensor<float, 2, 2> T3) {
  #pragma unroll 1
  for(nvfuser_index_t i0 = 0LL; i0 < T0.logical_size[0LL]; ++i0) {
    nvfuser_index_t i1;
    i1 = T1.logical_size[1LL] * i0;
    Array<float, 1LL, 1> T2;
    #pragma unroll 1
    for(nvfuser_index_t i2 = 0LL; i2 < T1.logical_size[1LL]; ++i2) {
      T2[0LL]
         = T0[i0];
    }
    #pragma unroll 1
    for(nvfuser_index_t i3 = 0LL; i3 < T1.logical_size[1LL]; ++i3) {
      nvfuser_index_t i4;
      i4 = i1 + i3;
      T3[i4]
        = T2[0LL]
        + T1[i4];
    }
  }
}

The code is not incorrect, but T2 is redundantly defined over the loop of T1.logical_size[1] because of the promotion of bS11. Note that the allocation of T2 is not affected because broadcast IDs are excluded before promotion.

In this PR, for loop groups that only consist of broadcast IDs, promotion to non-broadcast is reverted. With the change, the above fusion results in the kernel below:

__global__ void CUDAGeneratedKernel(Tensor<float, 2, 2> T0, Tensor<float, 2, 2> T1, Tensor<float, 2, 2> T3) {
  #pragma unroll 1
  for(nvfuser_index_t i0 = 0LL; i0 < T0.logical_size[0LL]; ++i0) {
    nvfuser_index_t i1;
    i1 = T1.logical_size[1LL] * i0;
    Array<float, 1LL, 1> T2;
    T2[0LL]
       = T0[i0];
    #pragma unroll 1
    for(nvfuser_index_t i2 = 0LL; i2 < T1.logical_size[1LL]; ++i2) {
      nvfuser_index_t i3;
      i3 = i1 + i2;
      T3[i3]
        = T2[0LL]
        + T1[i3];
    }
  }
}

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 27, 2025

!test

Copy link

github-actions bot commented Mar 27, 2025

Review updated until commit 3db4f2a

Description

  • Avoid promoting loop groups consisting only of broadcast IDs

  • Add test case for broadcast-only loop groups

  • Use std::ranges::any_of for more concise code


Changes walkthrough 📝

Relevant files
Enhancement
loop_promotion.cpp
Avoid promoting broadcast-only loop groups                             

csrc/id_model/loop_promotion.cpp

  • Track IDs of broadcast-only loop groups
  • Revert promotions for broadcast-only loop groups
  • Use std::ranges::any_of for more concise code
  • +79/-25 
    loop_promotion.h
    Declare method and member for broadcast-only loop groups 

    csrc/id_model/loop_promotion.h

  • Declare revertBroadcastOnlyLoopGroups method
  • Add broadcast_only_loop_group_ids_ member variable
  • +6/-0     
    Tests
    test_id_model.cpp
    Add test for broadcast-only loop groups                                   

    tests/cpp/test_id_model.cpp

    • Add test case for broadcast-only loop groups
    +45/-0   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Performance Impact

    The changes might affect the performance of the loop promotion logic. It's important to verify that the performance goals are met and that there are no regressions.

    std::unordered_map<ValGroup, IterDomain*> LoopPromotionMapBuilder::build() {
      // Some quick shortcut conditions to skip the full loop promotion
      // analysis. These are not comprehensive. Should add more conditions
      // if necessary.
      if (!force_full_loop_promotion_analysis_ && isLoopGraphUniform(id_model_)) {
        return buildWithNoBroadcast();
      }
    
      // Keep track of IDs whose loop groups only have broadcast
      // IDs. These IDs should not need to be promoted to non-broadcastg
      // IDs. Note that we can't just remember these loop ValGroups as
      // they might be replaced during the following analysis.
      for (const auto& loop_group :
           idGraph(IdMappingMode::LOOP).disjointValSets().disjointSets()) {
        if (std::ranges::any_of(*loop_group, [](Val* val) {
              return !val->as<IterDomain>()->isBroadcast();
            })) {
          continue;
        }
    
        // Currently, only exact-mapped loop groups are considered. This
        // condition is required as we are going to replace promotion IDs
        // with an arbitrary member ID.
        if (idGraph(IdMappingMode::EXACT).toGroups(*loop_group).size() != 1) {
          continue;
        }
    
        broadcast_only_loop_group_ids_.insert(
            loop_group->begin(), loop_group->end());
      }
    
      // Make an intersection of the exact and loop map. This will group together
    Code Clarity

    The new function revertBroadcastOnlyLoopGroups is introduced but lacks detailed comments explaining its purpose and logic. Adding more comments would improve code readability.

        map.emplace(loop_group, promotion);
      }
    
      return map;
    }
    
    void LoopPromotionMapBuilder::revertBroadcastOnlyLoopGroups(
        std::unordered_map<ValGroup, IterDomain*>& loop_promotion_map) const {
      // If a loop group originally only consisted of broadcast IDs
      // and now is promoted to a non-broadcast ID, it should not need to
      // be promoted.
      for (auto& [loop_group, current_promotion_id] : loop_promotion_map) {
        if (current_promotion_id->isBroadcast()) {
          continue;
        }
    
        // As long as there's a single ID marked as broadcast only, this
        // group originally consisted of broadcast IDs only. Note that,
        // since new IDs were added as part of the promotion analysis, not
        // all of the IDs are included in the broadcast only set.
        IterDomain* original_broadcast_id = nullptr;
        for (auto val : *loop_group) {
          if (broadcast_only_loop_group_ids_.contains(val)) {
            original_broadcast_id = val->as<IterDomain>();
            break;
          }
        }
        if (original_broadcast_id == nullptr) {
          continue;
        }
    
        // Note that this promotion should be valid for the existing
        // IDs that originate from the fusion, but the loop group also
        // contains other non-broadcast IDs for loop promotion, e.g.,
        // current_promotion_id. This replacement means those
        // non-broadcast IDs are also promoted to the broadcast ID, which
        // does not make sense. For example, in the case of
        // IdModelTest.BroadcastOnlyNoLoopPromotion, the innermost loop ID
        // of tv2 has no mapping in the original fusion, but its loop
        // group gets additional IDs, iS17 and iS19, both of which are
        // exact mapped with the innermost loop IDs of tv1 and tv3.
        //
        // TODO: Consider cleaning up the unused non-broadcast IDs.
        current_promotion_id = original_broadcast_id;
      }
    }
    Test Coverage

    While a new test BroadcastOnlyNoLoopPromotion is added, it's important to ensure that this test covers all edge cases and scenarios where loop groups only consist of broadcast IDs.

      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto outputs = executor_cache.runFusionWithInputs(inputs);
      testValidate(&fusion, outputs, inputs, __LINE__, __FILE__);
    }
    
    // When a loop group only includes broadcast IDs, the group should not
    // need to be promoted
    TEST_F(IdModelTest, BroadcastOnlyNoLoopPromotion) {
      auto fusion_ptr = std::make_unique<Fusion>();
      auto& fusion = *fusion_ptr;
      FusionGuard fg(fusion_ptr.get());
    
      auto tv0 = makeContigConcreteTensor({-1, 1});
      fusion.addInput(tv0);
      auto tv1 = makeContigTensor(2);
      fusion.addInput(tv1);
    
      auto tv2 = set(tv0);
      auto tv3 = add(tv2, tv1);
      fusion.addOutput(tv3);
    
      for (auto tv : fusion.allTvs()) {
        tv->split(1, 1, false);
        tv->reorder({{0, 1}, {1, 0}});
      }
    
      for (auto tv : fusion.allTvs()) {
        tv->inlineAt(2);
      }
    
      // T2_l_float[bS10{1}, iS4{i0}, bS11{1}] ca_pos( 2 )
      // = Set( T0_g_float[bS8{1}, iS0{i0}, bS9{1}], cache_op=Streaming )
      // T3_g_float[iS14{1}, iS6{i0}, iS15{i5}] ca_pos( 2 ) produce_pos( 2 )
      // = T2_l_float[bS10{1}, iS4{i0}, bS11{1}] ca_pos( 2 )
      // + T1_g_float[iS12{1}, iS2{i4}, iS13{i5}];
    
      // In this fusion, the innermost loop ID of tv2 is broadcast and is
      // not inlined. While its producer ID is promoted to the concrete
      // logical ID of tv3, it should not need to promote the loop ID as
      // it's just a broadcast.
    
      IdModel id_model(&fusion, /*build_graphs=*/true);
    
      auto promotion_id = id_model.loopPromotionMap().at(
          id_model.idGraph(IdMappingMode::LOOP).toGroup(tv2->axis(-1)));
      EXPECT_TRUE(promotion_id->isBroadcast())
          << "Should not be promoted a non-broadcast ID: "
          << promotion_id->toString();
    }

    @naoyam naoyam marked this pull request as ready for review March 27, 2025 14:58
    @naoyam
    Copy link
    Collaborator Author

    naoyam commented Mar 27, 2025

    !test --diff

    @naoyam naoyam requested a review from zasdfgbnm March 27, 2025 14:59
    @naoyam
    Copy link
    Collaborator Author

    naoyam commented Apr 1, 2025

    !test

    @naoyam naoyam merged commit 360b232 into main Apr 1, 2025
    48 of 49 checks passed
    @naoyam naoyam deleted the dont_promote_broadcast_only_loop_groups branch April 1, 2025 19:41
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants