Skip to content

Conversation

@tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Jan 30, 2024

Stack from ghstack (oldest at bottom):

As titled. Only enable the gradient scaler if mixed precision training is used with parameter dtype being fp16.

tianyu-l added a commit that referenced this pull request Jan 30, 2024
ghstack-source-id: 3d250d1
Pull Request resolved: #25
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 30, 2024
Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but one side issue.
We only want the grad scaler to be active if the param dtype is fp16 - otherwise we are checking for infs/nans that won't exist and adding overhead.
Cleanest imo is check that and then create a no op scaler via the enabled flag in ShardedGradScaler init set to False, or uglier, if/else the scaler code.
We are currently setting the dtype in the fsdp config, but it's not exposed to the user yet, but you could add a quick dtype check:
https://github.com/lessw2020/torchtrain/blob/2c7e48bbab123eefea2a6819320d5d2f81cec438/torchtrain/parallelisms/parallelize_llama.py#L57

@tianyu-l
Copy link
Contributor Author

lgtm but one side issue. We only want the grad scaler to be active if the param dtype is fp16 - otherwise we are checking for infs/nans that won't exist and adding overhead. Cleanest imo is check that and then create a no op scaler via the enabled flag in ShardedGradScaler init set to False, or uglier, if/else the scaler code. We are currently setting the dtype in the fsdp config, but it's not exposed to the user yet, but you could add a quick dtype check: https://github.com/lessw2020/torchtrain/blob/2c7e48bbab123eefea2a6819320d5d2f81cec438/torchtrain/parallelisms/parallelize_llama.py#L57

Thanks for the note! I'm new to gradient scaling, and not sure if grad scaler should be used only when the param dtype is fp16.

According to the GradScaler doc (https://github.com/pytorch/pytorch/blob/main/torch/amp/grad_scaler.py#L90), a larger scale should be used to minimize gradient underflow, and only for fp16 this could cause gradient overflow. This seems to implicitly suggest that gradient scaling should be used in general and only for fp16 the Inf/NaN check is needed.

I might interpret it wrong, so it would great if you can share some paper/notes on grad scaling!

@tianyu-l tianyu-l requested a review from wanchaol January 31, 2024 00:42
@lessw2020
Copy link
Contributor

the larger scaling factor they reference though is for fp16 - you can both underflow and overflow with FP16. Amp is inherently using FP16, they just don't expressly mention it in that paragraph.

For BF16 and FP32 have the same dynamic range (only fp16 has the limited range issue) so no reason to scale/use ShardedGradeScaler for those dtypes.
I did double check with Andrew to make sure (though we would have been telling people the incorrect advise for 2 years if not correct lol) and he concurred that there is no need to use ShardedGradScaler except for fp16 case.

@lessw2020
Copy link
Contributor

There's a good overview on fp16 and scaling here:
https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#hpformat

tianyu-l added a commit that referenced this pull request Jan 31, 2024
ghstack-source-id: a5a2170
Pull Request resolved: #25
@tianyu-l
Copy link
Contributor Author

There's a good overview on fp16 and scaling here: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#hpformat

Just realized I read the paper Mixed Precision Training long ago but forgot almost everything. Good to learn it again :)

@tianyu-l tianyu-l linked an issue Jan 31, 2024 that may be closed by this pull request
Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks for updating to check for fp16.
Let's see if Wanchao has any feedback as well.

As titled. Only enable the gradient scaler if mixed precision training is used with parameter dtype being fp16.

[ghstack-poisoned]
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!


def build_grad_scaler(model):
# apply gradient scaling if mixed precision training is enabled with fp16 param dtype
if model.mixed_precision.param_dtype == torch.float16:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably later control the param_dtype as a TrainOption so that we don't rely this model be a FSDP instance. right now it looks good

@wanchaol
Copy link
Collaborator

Looks like you are using ghstack. Please note that for ghstack, you need to use ghstack land in cmdline instead of clicking squash and merge button, the latter will not merge the changes to main and to the base branch instead

@tianyu-l tianyu-l merged commit 0591139 into gh/tianyu-l/1/base Jan 31, 2024
tianyu-l added a commit that referenced this pull request Jan 31, 2024
ghstack-source-id: 4cbd0b9
Pull Request resolved: #25
@tianyu-l tianyu-l deleted the gh/tianyu-l/1/head branch January 31, 2024 23:05
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: 4cbd0b9
Pull Request resolved: #25
payoto pushed a commit to graphcore-research/torchtitan-fork that referenced this pull request Feb 7, 2025
…logger-api

Adding custom metrics logger API.
wconstab added a commit that referenced this pull request Nov 21, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:#27 std::error_code::default_error_condition() const from ??:0
[rank4]:#28 start_thread from ??:0
[rank4]:#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Dec 2, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0
[rank4]:pytorch#28 start_thread from ??:0
[rank4]:pytorch#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add FSDP grad scaler to the train loop

5 participants