[torchbench] `moco` fails to run. #6083

ysiraichi · 2023-12-09T12:09:36Z

🐛 Bug

Running the upstreamed benchmarking scripts with the following command results in an unexpected error.

python xla/benchmarks/experiment_runner.py \
       --suite-name torchbench \
       --accelerator cuda \
       --xla PJRT --xla None \
       --dynamo openxla --dynamo None \
       --test eval --test train \
       --repeat 30 --iterations-per-run 5 \
       --print-subprocess \
       --no-resume -k moco

Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 601, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 597, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 65, in run
    self.run_single_experiment(experiment_config, model_config)
  File "xla/benchmarks/experiment_runner.py", line 161, in run_single_experiment
    run_metrics, output = self.timed_run(benchmark_experiment,
  File "xla/benchmarks/experiment_runner.py", line 328, in timed_run
    output = loop()
  File "xla/benchmarks/experiment_runner.py", line 310, in loop
    output = benchmark_model.model_iter_fn(
  File "xla/benchmarks/benchmark_model.py", line 154, in eval
    pred = self.module(*inputs)
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "torch/nn/parallel/distributed.py", line 1519, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "torch/nn/parallel/distributed.py", line 1420, in _pre_forward
    self._sync_buffers()
  File "torch/nn/parallel/distributed.py", line 2040, in _sync_buffers
    self._sync_module_buffers(authoritative_rank)
  File "torch/nn/parallel/distributed.py", line 2044, in _sync_module_buffers
    self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
  File "torch/nn/parallel/distributed.py", line 2066, in _default_broadcast_coalesced
    self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
  File "torch/nn/parallel/distributed.py", line 1981, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: No backend type associated with device type cpu

Environment

PyTorch/XLA: 402166b

The text was updated successfully, but these errors were encountered:

lezcano · 2024-01-29T14:46:48Z

Is this one still relevant?

ysiraichi · 2024-01-29T17:20:20Z

Yes. I got it in the last benchmark run.

ysiraichi · 2024-06-04T21:26:35Z

I think I finally understood what was going wrong, here.

By setting TORCH_SHOW_DISPATCH_TRACE=1, we can see that the error is raised at some point when dispatching to the c10d::broadcast_ distributed operation. Not only that, but we can also see that it's being fallback to CPU.

 [call] op=[c10d::broadcast_], key=[AutogradXLA]
  [redispatchBoxed] op=[c10d::broadcast_], key=[Functionalize]
   [callBoxed] op=[c10d::broadcast_], key=[AutogradXLA]
    [redispatchBoxed] op=[c10d::broadcast_], key=[XLA]
     [call] op=[aten::_to_cpu], key=[AutogradXLA]
      [redispatchBoxed] op=[aten::_to_cpu], key=[XLA]
       [call] op=[aten::empty.memory_format], key=[BackendSelect]
        [redispatch] op=[aten::empty.memory_format], key=[CPU]
     [call] op=[aten::_to_cpu], key=[Undefined]
     [redispatchBoxed] op=[c10d::broadcast_], key=[CPU]

At first, I thought that there was not dispatch registered for c10d::broadcast_. However, actually, the error was triggered inside c10d::brodcast_ registered dispatch for CPU. Specifically, when trying to get the (distributed) backend for CPU [1, 2]. My guess is that this error happens because we haven't initialized a ProcessGroup for CPU [3].

I think that the way to make it work is to actually run the fallback on CUDA, since that was the device we initialized ProcessGroup for [3].

@lezcano @JackCaoG
Let me know what you think.

lezcano · 2024-06-04T21:31:35Z

This sounds reasonable to me. We can try to land fallbacks to CUDA first and then revisit this one after that

JackCaoG · 2024-06-04T23:09:16Z

..hmm why is this model actually uses c10d.. is it doing multi device training?

ysiraichi · 2024-06-05T13:39:41Z

I guess that's a fair question. I'm not sure why it was implemented like that, since they set the world_size=1.

ysiraichi · 2024-06-05T13:40:45Z

On second thoughts, maybe this is a case we should change the benchmark code for initializing the ProcessGroup for XLA [1]. What do you think?

JackCaoG · 2024-06-05T17:25:06Z

yea that's worth a try.. through my impression is still that torchbench only do single process benchmark which made me wonder why there is even distributed stuff involved.

ysiraichi added the xla:gpu label Dec 9, 2023

ysiraichi mentioned this issue Dec 9, 2023

Failing Torchbench Models: tracking issue #5932

Open

ysiraichi mentioned this issue Jun 12, 2024

[benchmarks] Initialize moco and TPU experiments on XLA. #7257

Merged

ysiraichi closed this as completed in #7257 Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] `moco` fails to run. #6083

[torchbench] `moco` fails to run. #6083

ysiraichi commented Dec 9, 2023

lezcano commented Jan 29, 2024

ysiraichi commented Jan 29, 2024

ysiraichi commented Jun 4, 2024

lezcano commented Jun 4, 2024

JackCaoG commented Jun 4, 2024

ysiraichi commented Jun 5, 2024

ysiraichi commented Jun 5, 2024

JackCaoG commented Jun 5, 2024

[torchbench] moco fails to run. #6083

[torchbench] moco fails to run. #6083

Comments

ysiraichi commented Dec 9, 2023

🐛 Bug

Environment

lezcano commented Jan 29, 2024

ysiraichi commented Jan 29, 2024

ysiraichi commented Jun 4, 2024

lezcano commented Jun 4, 2024

JackCaoG commented Jun 4, 2024

ysiraichi commented Jun 5, 2024

ysiraichi commented Jun 5, 2024

JackCaoG commented Jun 5, 2024

[torchbench] `moco` fails to run. #6083

[torchbench] `moco` fails to run. #6083