Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_persistent_buffer_size may be smaller than total_reduction_numel #4075

Open
naoyam opened this issue Mar 13, 2025 · 2 comments
Open

max_persistent_buffer_size may be smaller than total_reduction_numel #4075

naoyam opened this issue Mar 13, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@naoyam
Copy link
Collaborator

naoyam commented Mar 13, 2025

Repro:

TEST_F(PersistentBufferTest, BroadcastDivByZero) {
  auto fusion_ptr = std::make_unique<Fusion>();
  auto& fusion = *fusion_ptr;
  FusionGuard fg(fusion_ptr.get());

  auto tv0 = makeSymbolicTensor(1, DataType::BFloat16);
  fusion.addInput(tv0);
  auto tv1 = makeSymbolicTensor(2, DataType::BFloat16);
  fusion.addInput(tv1);

  auto tv2 = broadcast(tv0, {false, true});
  auto tv3 = castOp(DataType::Float, tv2);
  auto tv4 = castOp(DataType::Float, tv1);
  auto tv5 = add(tv3, tv4);
  auto tv6 = sum(tv5, {0, 1});
  auto tv7 = broadcast(tv6, {true});

  auto tv8 = castOp(DataType::Float, tv0);
  auto tv9 = add(tv8, tv7);
  auto tv10 = castOp(DataType::BFloat16, tv9);
  fusion.addOutput(tv10);

  fusion.printMath();

  auto options = at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA, 0);
  auto t0 = at::randn({64}, options);
  auto t1 = at::randn({64, 16}, options);
  SchedulerRuntimeInfo runtime_info(fusion_ptr.get(), {t0, t1});
  ASSERT_TRUE(Schedule::canSchedule(
      SchedulerType::InnerPersistent, fusion_ptr.get(), runtime_info));
  auto scheduler =
      SchedulerEntry::makeSchedulerInstance(SchedulerType::InnerPersistent);
  auto heuristic_params =
      scheduler->computeHeuristics(fusion_ptr.get(), runtime_info);
  scheduler->schedule(fusion_ptr.get(), heuristic_params.get());
}

This results in a divsion-by-zero error:

0x0000555556a951d6 in nvfuser::scheduler_utils::safeDiv (x=224, y=0) at /home/nmaruyama/nvfuser/debug1/csrc/scheduler/utils.h:104
104       return std::max(x / y, (int64_t)1);
(gdb) bt
#0  0x0000555556a951d6 in nvfuser::scheduler_utils::safeDiv (x=224, y=0) at /home/nmaruyama/nvfuser/debug1/csrc/scheduler/utils.h:104
#1  0x0000555556a8f485 in nvfuser::(anonymous namespace)::getMaxPersistentBatch (buffer_bytes_per_batch=0, target_threads_per_sm=896, register_overhead=16, is_high_bandwidth_flops_ratio=false)
    at /home/nmaruyama/nvfuser/debug1/csrc/scheduler/normalization_inner.cpp:147
#2  0x0000555556a8dde8 in nvfuser::(anonymous namespace)::innerPersistentHeuristic2D (properties=..., rparams=0x55555b573f90) at /home/nmaruyama/nvfuser/debug1/csrc/scheduler/normalization_inner.cpp:414

This is because the persistent buffer, t0, has a smaller number of elements (64) than the number of reduced elements (16 * 64).

@naoyam naoyam added the bug Something isn't working label Mar 13, 2025
@naoyam
Copy link
Collaborator Author

naoyam commented Mar 13, 2025

Note that just using ceilDiv here won't work. For example, suppose t0 is [i(4), r(2)], meaning it has a non-reduction ID of extent 4 and a reduction ID of extent 2. Clearly, the number of non-reduction elements per a reduction element is 2. However, if it's broadcast to size, say, 8, and the broadcast is concretized and reduced, the total number of reduction elements is 16, so a ceilDiv would yield 1.

It seems that an underlying problem is that we are dividing the number of elements of a pre-broadcast tensor with a number of elements of a post-broadcast tensor.

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 13, 2025

This may be related to #4074. We should first address #4074.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants