Skip to content

Conversation

@lessw2020
Copy link
Contributor

This PR adds torch.compile to torchtrain.
1 - Control of the compile option is added to the main config toml.
2 - A config dir with config utils has been made to centralize config loading (get_config() returns the project config file).
3 - Profile.py and train.py have been updated to both load and use the get_config() function.
4 - If use_compile option is on (default = true), then torch.compile is run for the model and logged that it's running for user info.

Testing:
verifed torch compile on/off, with and without profiling.

General comments:
I did not add the same compiler option for args yet as thinking better to make a master class that handles args/configs to produce the final settings for the user.
Also named the config folder tt_config (torch train config) to try and avoid confusion with other config dirs, but could also just revert to generic config.

Used ruff for formatting and linting.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 24, 2024
@lessw2020
Copy link
Contributor Author

PR maps to issue #8

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should separate the config related changes with the torch.compile change, please see inline comments

@lessw2020
Copy link
Contributor Author

Updated pr with feedback.
1 - compile is controlled via --compile flag.
2 - note that torch.compile is complaining about being unable to lower complex operators...I have a compile friendly impl of rotary embeddings so maybe we want to switch to that.
3 - will do the config with args/dataclasses/toml as diff pr.

compiler_warnings

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! have some minor suggestions and questions

@lessw2020 lessw2020 merged commit 940eb0d into pytorch:main Jan 25, 2024
@lessw2020 lessw2020 deleted the add_compiler branch January 25, 2024 04:21
lessw2020 added a commit that referenced this pull request Apr 18, 2024
This PR adds torch.compile to torchtrain.
1 - Control of the compile option is added to the main config toml. 
2 - A config dir with config utils has been made to centralize config
loading (get_config() returns the project config file).
3 - Profile.py and train.py have been updated to both load and use the
get_config() function.
4 - If use_compile option is on (default = true), then torch.compile is
run for the model and logged that it's running for user info.

Testing:
verifed torch compile on/off, with and without profiling. 

General comments:
I did not add the same compiler option for args yet as thinking better
to make a master class that handles args/configs to produce the final
settings for the user.
Also named the config folder tt_config (torch train config) to try and
avoid confusion with other config dirs, but could also just revert to
generic config.

Used ruff for formatting and linting.
jinsun-yoo pushed a commit to jinsun-yoo/torchtitan that referenced this pull request Oct 30, 2024
wconstab added a commit that referenced this pull request Nov 21, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:#27 std::error_code::default_error_condition() const from ??:0
[rank4]:#28 start_thread from ??:0
[rank4]:#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Dec 2, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0
[rank4]:pytorch#28 start_thread from ??:0
[rank4]:pytorch#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants