Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] moco fails to run. #6083

Closed
ysiraichi opened this issue Dec 9, 2023 · 8 comments · Fixed by #7257
Closed

[torchbench] moco fails to run. #6083

ysiraichi opened this issue Dec 9, 2023 · 8 comments · Fixed by #7257
Labels

Comments

@ysiraichi
Copy link
Collaborator

🐛 Bug

Running the upstreamed benchmarking scripts with the following command results in an unexpected error.

python xla/benchmarks/experiment_runner.py \
       --suite-name torchbench \
       --accelerator cuda \
       --xla PJRT --xla None \
       --dynamo openxla --dynamo None \
       --test eval --test train \
       --repeat 30 --iterations-per-run 5 \
       --print-subprocess \
       --no-resume -k moco
Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 601, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 597, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 65, in run
    self.run_single_experiment(experiment_config, model_config)
  File "xla/benchmarks/experiment_runner.py", line 161, in run_single_experiment
    run_metrics, output = self.timed_run(benchmark_experiment,
  File "xla/benchmarks/experiment_runner.py", line 328, in timed_run
    output = loop()
  File "xla/benchmarks/experiment_runner.py", line 310, in loop
    output = benchmark_model.model_iter_fn(
  File "xla/benchmarks/benchmark_model.py", line 154, in eval
    pred = self.module(*inputs)
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "torch/nn/parallel/distributed.py", line 1519, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "torch/nn/parallel/distributed.py", line 1420, in _pre_forward
    self._sync_buffers()
  File "torch/nn/parallel/distributed.py", line 2040, in _sync_buffers
    self._sync_module_buffers(authoritative_rank)
  File "torch/nn/parallel/distributed.py", line 2044, in _sync_module_buffers
    self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
  File "torch/nn/parallel/distributed.py", line 2066, in _default_broadcast_coalesced
    self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
  File "torch/nn/parallel/distributed.py", line 1981, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: No backend type associated with device type cpu

Environment

@lezcano
Copy link
Collaborator

lezcano commented Jan 29, 2024

Is this one still relevant?

@ysiraichi
Copy link
Collaborator Author

Yes. I got it in the last benchmark run.

@ysiraichi
Copy link
Collaborator Author

I think I finally understood what was going wrong, here.

By setting TORCH_SHOW_DISPATCH_TRACE=1, we can see that the error is raised at some point when dispatching to the c10d::broadcast_ distributed operation. Not only that, but we can also see that it's being fallback to CPU.

 [call] op=[c10d::broadcast_], key=[AutogradXLA]
  [redispatchBoxed] op=[c10d::broadcast_], key=[Functionalize]
   [callBoxed] op=[c10d::broadcast_], key=[AutogradXLA]
    [redispatchBoxed] op=[c10d::broadcast_], key=[XLA]
     [call] op=[aten::_to_cpu], key=[AutogradXLA]
      [redispatchBoxed] op=[aten::_to_cpu], key=[XLA]
       [call] op=[aten::empty.memory_format], key=[BackendSelect]
        [redispatch] op=[aten::empty.memory_format], key=[CPU]
     [call] op=[aten::_to_cpu], key=[Undefined]
     [redispatchBoxed] op=[c10d::broadcast_], key=[CPU]

At first, I thought that there was not dispatch registered for c10d::broadcast_. However, actually, the error was triggered inside c10d::brodcast_ registered dispatch for CPU. Specifically, when trying to get the (distributed) backend for CPU [1, 2]. My guess is that this error happens because we haven't initialized a ProcessGroup for CPU [3].

I think that the way to make it work is to actually run the fallback on CUDA, since that was the device we initialized ProcessGroup for [3].

@lezcano @JackCaoG
Let me know what you think.

@lezcano
Copy link
Collaborator

lezcano commented Jun 4, 2024

This sounds reasonable to me. We can try to land fallbacks to CUDA first and then revisit this one after that

@JackCaoG
Copy link
Collaborator

JackCaoG commented Jun 4, 2024

..hmm why is this model actually uses c10d.. is it doing multi device training?

@ysiraichi
Copy link
Collaborator Author

I guess that's a fair question. I'm not sure why it was implemented like that, since they set the world_size=1.

@ysiraichi
Copy link
Collaborator Author

On second thoughts, maybe this is a case we should change the benchmark code for initializing the ProcessGroup for XLA [1]. What do you think?

@JackCaoG
Copy link
Collaborator

JackCaoG commented Jun 5, 2024

yea that's worth a try.. through my impression is still that torchbench only do single process benchmark which made me wonder why there is even distributed stuff involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants