Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs with HostIr integration into FusionExecutorCache #4136

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

nsarka
Copy link
Member

@nsarka nsarka commented Mar 25, 2025

This PR fixes some bugs with HostIr integration into FusionExecutorCache. The list of fixes are:

  1. Expressions within the SegmentedGroup were out of order. They get sorted during gpu lowering. So, I topologically sorted them after adding them to the HostIrContainer

  2. CompileFusionParallel launches compileKernel with multiple threads from a pool. However, the main thread does not wait on completion before continuing. I added fix so that it will wait. (Edit: Jingyue pointed out the wait was there already. I moved the HostIr additions from previous PRs to after the wait)

  3. During the preseg pass stage, Sets are added between expressions with a different sharding. These sets needed lowering into a communication. I added the missing lowering. (Edit: this is going into a separate PR)

With these changes, many (most?) of the MultiDevice tests and also DistributedTransformer tests are passing. The failing tests seem to be because of an issue with loop splitting. Here is an example:

TEST_F(MultiDeviceTest, nick_hostir_5) {
  Fusion fusion, fusion_cloned;
  FusionGuard fg(&fusion);
  TensorView* x = makeContigTensor(4);
  x->outer_split(0, 2);
  fusion.addInput(x);
  std::cout << "x transforms: " << std::endl;
  x->printTransforms();
  fg.setCurFusion(&fusion_cloned);
  IrCloner ir_cloner(&fusion_cloned);
  auto in_clone = ir_cloner.clone(fusion.inputs());
  std::cout << "cloned_x transforms: " << std::endl;
  in_clone[0]->as<TensorView>()->printTransforms();
/*
x transforms:
 logical domain : (iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4})
 contiguity: t t t t
  Outer split: iS0{i0} by factor 2 -> iS4{2}, iS5{( ceilDiv(i0, 2) )}          <----- outer split
 loop domain : (iS4{2}, iS5{( ceilDiv(i0, 2) )}, iS1{i2}, iS2{i3}, iS3{i4})
cloned_x transforms:
 logical domain : (iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4})
 contiguity: t t t t                                                                                       <----- no outer split, cloner does not propagate it
 loop domain : (iS4{2}, iS5{i6}, iS1{i2}, iS2{i3}, iS3{i4})
*/
}

After cloning exprs into the hostir container, splits are not propagated.

@nsarka nsarka requested review from wujingyue and samnordmann March 25, 2025 17:14
@nsarka nsarka self-assigned this Mar 25, 2025
@nsarka nsarka changed the title HostIr Integration #4 Fix bugs with HostIr integration into FusionExecutorCache Mar 25, 2025
@nsarka nsarka force-pushed the nsarka/hostir-integration-4 branch 2 times, most recently from 3ef189d to 7c7b48e Compare March 25, 2025 17:44
Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to hear many tests are passing behind the scene. Is it possible to add some tests to this PR? For example, it could be some existing tests but with host IR lowering also enabled (cf. TEST_P) or could be some new tests to test_host_ir_integration to catch potential regression?

@nsarka nsarka force-pushed the nsarka/hostir-integration-4 branch from 58d6217 to 1381a3c Compare March 25, 2025 21:15
@nsarka
Copy link
Member Author

nsarka commented Mar 25, 2025

It's great to hear many tests are passing behind the scene. Is it possible to add some tests to this PR? For example, it could be some existing tests but with host IR lowering also enabled (cf. TEST_P) or could be some new tests to test_host_ir_integration to catch potential regression?

Updated with TEST_P on tests in tests/cpp/test_multidevice_lower_communication.cpp. All of these but the allgather loop split test (which I disabled) pass.

@nsarka nsarka force-pushed the nsarka/hostir-integration-4 branch from 39e427a to d62498b Compare March 25, 2025 22:00
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nsarka nsarka force-pushed the nsarka/hostir-integration-4 branch 2 times, most recently from 01bf472 to bc886fd Compare March 26, 2025 16:59
@nsarka
Copy link
Member Author

nsarka commented Mar 26, 2025

@wujingyue I removed the host ir lowering code and replaced it with a TODO. I also updated the tests to just pass false for enable hostir lowering. The next PR i can re-enable them, and re-add the hostir lowering code.

@nsarka nsarka force-pushed the nsarka/hostir-integration-4 branch from bc886fd to f08db67 Compare March 26, 2025 21:58
@nsarka
Copy link
Member Author

nsarka commented Mar 26, 2025

!test

Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with comments

public testing::WithParamInterface<InOutMesh> {};
class LowerGatherTest
: public MultiDeviceTest,
public testing::WithParamInterface<std::tuple<InOutMesh, bool>> {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public testing::WithParamInterface<std::tuple<InOutMesh, bool>> {};
public testing::WithParamInterface<std::tuple<DeviceMesh, DeviceMesh, bool>> {};

I don't see a good reason for nesting. This will probably simplify the structured bindings below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is here because we want certain pairs of DeviceMesh, and not every pair that Combine(InMesh, OutMesh) would give

@nsarka nsarka force-pushed the nsarka/hostir-integration-4 branch 2 times, most recently from 5983947 to 36c8804 Compare April 1, 2025 17:56
@nsarka nsarka force-pushed the nsarka/hostir-integration-4 branch from c3a186e to b9abb73 Compare April 1, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants