[HostIR] Print index definitions in `HostIrContainer::print` by samnordmann · Pull Request #5327 · NVIDIA/Fuser

samnordmann · 2025-10-06T17:05:57Z

Improve printing of HostIrContainer by printing the index computations which are not explicitly part of the topLevelExprs.
Example from #5259

%HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) :
  T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
  T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false)
  GetCurrentStream into Stream 0
  FOR streamIdx in istreamIdx10{8}:
    SetCurrentStream to Stream ( streamIdx % numberOfStreams )
    Synchronize Stream 0
  FOR streamIdx in istreamIdx10{8}:
    SetCurrentStream to Stream ( streamIdx % numberOfStreams )
    T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 )
    IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ):
      T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
         = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 )
      T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
         = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming )
    ELSE:
      ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA),
      P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA)
      P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA)
      Wait Communication 38
      Wait Communication 37
    T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i84 )
    T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})
       = linear(T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}),
                T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      ,
          T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})      )
    SetCurrentStream to Stream 0
    Synchronize Stream ( streamIdx % numberOfStreams )
} // %HostIrContainer

Index definitions:
  i111 = streamIdx % numberOfStreams;
  i90 = i88 % 8;
  i32 = i30 * 1024;
  i30 = 8 * 128;
  i86 = rank - streamIdx;
  i82 = rank + streamIdx;
  i74 = 8 * 128;
  i76 = i74 * 1024;
  i84 = i82 % 8;
  i88 = 8 + i86;

github-actions · 2025-10-06T17:06:58Z

Review updated until commit 8f4be43

Description

Print index definitions in HostIrContainer print output
Add conditional debug printing for index values
Improve debug visibility of scalar index computations
Enhance IR debugging with structured index info

Changes walkthrough 📝

Relevant files

Enhancement

printer.cpp `Print index definitions in HostIrContainer` csrc/ir/printer.cpp Added printing of index definitions in HostIrContainer Only prints when debug option 'indices' is enabled Filters for scalar index-type values with definitions Increases indentation for better output structure	+14/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests

⚡ Recommended focus areas for review

Debug Output Control

The debug print logic for index definitions is gated by a debug dump argument, but there is no clear indication of how this affects existing logging behavior or whether it could produce excessive output in certain configurations.

// Print the definitions of the indices that are used in the host_ir_container
if (hasDebugDumpArgument(DebugDumpOption::HostIr, "indices")) {
  os() << "Index definitions:\n";
  indent_size_++;
  for (Val* val : host_ir_container->vals()) {
    if (val->isScalar() && val->definition() != nullptr &&
        val->dtype() == DataType::Index) {
      os() << val->definition()->toString(indent_size_);
    }
  }
  indent_size_--;
  os() << "\n";
}

Val Filtering Logic

The filtering of Vals to print index definitions relies on type checks and definition presence, but does not verify if the values are actually used in the host IR container, potentially leading to irrelevant or redundant output.

for (Val* val : host_ir_container->vals()) {
  if (val->isScalar() && val->definition() != nullptr &&
      val->dtype() == DataType::Index) {
    os() << val->definition()->toString(indent_size_);
  }
}

samnordmann · 2025-10-06T17:09:46Z

!test

wujingyue · 2025-10-06T17:57:02Z

which are not explicitly part of the topLevelExprs

Can you remind me why they aren't part of topLevelExprs? Analogously, imagine you write a C++ loop

for (int i = 0; i < 10; i++) {
  int j = i * 2;
  a[j] = ...
}

j is in the same loop scope as a[j] = ... even though it's a merely scalar operation.

samnordmann · 2025-10-07T09:32:39Z

which are not explicitly part of the topLevelExprs

Can you remind me why they aren't part of topLevelExprs? Analogously, imagine you write a C++ loop
for (int i = 0; i < 10; i++) {
  int j = i * 2;
  a[j] = ...
}
j is in the same loop scope as a[j] = ... even though it's a merely scalar operation.

I am not sure to understand your comment correctly. Index computations are not part of topLevelExprs just because they don't need to. When we call ExpressionEvaluator::evaluate, the evaluation might involve some computation, which is done by the ExpressionEvaluator at runtime, but the computation does not explicitly appear in the HostIrContainer's top_level_exprs_.

The example you wrote looks good to me, but I am not sure to understand what you suggest by it.

The example I provided in the PR description explains well the use case for the present patch.

wujingyue · 2025-10-08T06:41:02Z

just because they don't need to

That's fair enough and LGTM. I recall that for MultiDeviceExecutor we also have to find and ExpressionEvaluator::invalidate index calculations that depend on the loop index so they can get different values in different iterations.

When Hanlin worked on host IR JIT, we realized that finding what indices to invalidate at "run" time creates problems for host latency. So, for the FusionExecutorCache integration, I let host IR lowering find these index calculations at "compile" time and put them in the scope of the for loop. This is done on a separate code path so doesn't affect MultiDeviceExecutor. I didn't get a chance to check with you -- hence my question earlier.

csrc/ir/printer.cpp

samnordmann · 2025-10-08T07:59:55Z

just because they don't need to

That's fair enough and LGTM. I recall that for MultiDeviceExecutor we also have to find and ExpressionEvaluator::invalidate index calculations that depend on the loop index so they can get different values in different iterations.

When Hanlin worked on host IR JIT, we realized that finding what indices to invalidate at "run" time creates problems for host latency. So, for the FusionExecutorCache integration, I let host IR lowering find these index calculations at "compile" time and put them in the scope of the for loop. This is done on a separate code path so doesn't affect MultiDeviceExecutor. I didn't get a chance to check with you -- hence my question earlier.

Ok, now I understand. I like the idea of what was done for host IR JIT. Do you have a pointer to the PR? It would make sense to do the same in MultiDeviceExecutor.

samnordmann · 2025-10-08T10:41:54Z

!test

Improve printing of HostIrContainer by printing the index computations which are not explicitly part of the `topLevelExprs`. Example from #5259 ``` %HostIrContainer { (T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=1048576, zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) Synchronize Stream 0 FOR streamIdx in istreamIdx10{8}: SetCurrentStream to Stream ( streamIdx % numberOfStreams ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T4_g___bfloat[istreamIdx10{8}, iS11{128}, iS12{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx10{8}, index = i84 ) IF Manual ( ( ( 8 + ( rank - streamIdx ) ) % 8 ) == rank ): T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{8}, index = 0 ) T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T5_l___bfloat[iS13{128}, iS14{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: ShareMemHandles(P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA), P2PCommunication 38 (type=send, buffer=T0_g___bfloat[ideviceIdx.x0{8}, iS1{128}, iS2{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i90, backend=CUDA) P2PCommunication 37 (type=recv, buffer=T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=i84, backend=CUDA) Wait Communication 38 Wait Communication 37 T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g___bfloat[istreamIdx6{8}, iS7{128}, iS8{1024}, rS9{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = istreamIdx6{8}, index = i84 ) T7_l___bfloat[iS17{128}, iS18{1024}, rS19{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) = linear(T6_l___bfloat[iS15{128}, iS16{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g___bfloat[iS3{1024}, iS4{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) , T2_g___bfloat[iS5{1024}] (DeviceMesh{0 1 2 3 4 5 6 7}) ) SetCurrentStream to Stream 0 Synchronize Stream ( streamIdx % numberOfStreams ) } // %HostIrContainer Index definitions: i111 = streamIdx % numberOfStreams; i90 = i88 % 8; i32 = i30 * 1024; i30 = 8 * 128; i86 = rank - streamIdx; i82 = rank + streamIdx; i74 = 8 * 128; i76 = i74 * 1024; i84 = i82 % 8; i88 = 8 + i86; ```

Print index definitions in host Ir container

b7e3704

samnordmann requested review from nsarka and wujingyue October 6, 2025 17:06

wujingyue approved these changes Oct 8, 2025

View reviewed changes

csrc/ir/printer.cpp Outdated Show resolved Hide resolved

csrc/ir/printer.cpp Outdated Show resolved Hide resolved

minor review

8f4be43

samnordmann merged commit ef5a717 into main Oct 8, 2025
64 of 65 checks passed

samnordmann deleted the host_ir_print_index branch October 8, 2025 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HostIR] Print index definitions in `HostIrContainer::print`#5327

[HostIR] Print index definitions in `HostIrContainer::print`#5327
samnordmann merged 2 commits intomainfrom
host_ir_print_index

samnordmann commented Oct 6, 2025

Uh oh!

github-actions bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

samnordmann commented Oct 6, 2025

Uh oh!

wujingyue commented Oct 6, 2025

Uh oh!

samnordmann commented Oct 7, 2025

Uh oh!

wujingyue commented Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

samnordmann commented Oct 8, 2025

Uh oh!

samnordmann commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samnordmann commented Oct 6, 2025

Uh oh!

github-actions bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

samnordmann commented Oct 6, 2025

Uh oh!

wujingyue commented Oct 6, 2025

Uh oh!

samnordmann commented Oct 7, 2025

Uh oh!

wujingyue commented Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

samnordmann commented Oct 8, 2025

Uh oh!

samnordmann commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Oct 6, 2025 •

edited

Loading