-
Notifications
You must be signed in to change notification settings - Fork 626
[torchtrain] add gradient scaler #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
lessw2020
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but one side issue.
We only want the grad scaler to be active if the param dtype is fp16 - otherwise we are checking for infs/nans that won't exist and adding overhead.
Cleanest imo is check that and then create a no op scaler via the enabled flag in ShardedGradScaler init set to False, or uglier, if/else the scaler code.
We are currently setting the dtype in the fsdp config, but it's not exposed to the user yet, but you could add a quick dtype check:
https://github.com/lessw2020/torchtrain/blob/2c7e48bbab123eefea2a6819320d5d2f81cec438/torchtrain/parallelisms/parallelize_llama.py#L57
Thanks for the note! I'm new to gradient scaling, and not sure if grad scaler should be used only when the param dtype is fp16. According to the GradScaler doc (https://github.com/pytorch/pytorch/blob/main/torch/amp/grad_scaler.py#L90), a larger scale should be used to minimize gradient underflow, and only for fp16 this could cause gradient overflow. This seems to implicitly suggest that gradient scaling should be used in general and only for fp16 the Inf/NaN check is needed. I might interpret it wrong, so it would great if you can share some paper/notes on grad scaling! |
|
the larger scaling factor they reference though is for fp16 - you can both underflow and overflow with FP16. Amp is inherently using FP16, they just don't expressly mention it in that paragraph. For BF16 and FP32 have the same dynamic range (only fp16 has the limited range issue) so no reason to scale/use ShardedGradeScaler for those dtypes. |
|
There's a good overview on fp16 and scaling here: |
[ghstack-poisoned]
Just realized I read the paper Mixed Precision Training long ago but forgot almost everything. Good to learn it again :) |
lessw2020
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Thanks for updating to check for fp16.
Let's see if Wanchao has any feedback as well.
As titled. Only enable the gradient scaler if mixed precision training is used with parameter dtype being fp16. [ghstack-poisoned]
wanchaol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
|
|
||
| def build_grad_scaler(model): | ||
| # apply gradient scaling if mixed precision training is enabled with fp16 param dtype | ||
| if model.mixed_precision.param_dtype == torch.float16: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably later control the param_dtype as a TrainOption so that we don't rely this model be a FSDP instance. right now it looks good
|
Looks like you are using |
…logger-api Adding custom metrics logger API.
recomputing MoE during backward ``` [rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.) [rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank4]:Exception in thread Thread-2 (run_backward): [rank4]:Traceback (most recent call last): [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward [rank4]: torch.autograd.backward( [rank4]: File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward [rank4]: _engine_run_backward( [rank4]: File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward [rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details) [rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first): [rank4]:C++ CapturedTraceback: [rank4]:#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function< std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 [rank4]:#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 [rank4]:#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 [rank4]:#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0 [rank4]:#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0 [rank4]:#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void* , c10::cuda::CUDAStream&) from ??:0 [rank4]:#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ? ?:0 [rank4]:#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0 [rank4]:#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c 10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> > , bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0 [rank4]:#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c 10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0 [rank4]:#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0 [rank4]:#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0 [rank4]:#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ? ?:0 [rank4]:#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std:: allocator<c10::IValue> >*) from :0 [rank4]:#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c 10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0 [rank4]:#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri ng<char, std::char_traits<char>, std::allocator<char> >) from :0 [rank4]:#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> > &&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1 1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0 [rank4]:#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0 [rank4]:#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0 [rank4]:#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu e> const&) from ??:0 [rank4]:#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0 [rank4]:#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0 [rank4]:#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0 [rank4]:#27 std::error_code::default_error_condition() const from ??:0 [rank4]:#28 start_thread from ??:0 [rank4]:#29 __clone3 from :0 [rank4]: [rank4]: [rank4]:The above exception was the direct cause of the following exception: [rank4]: [rank4]:Traceback (most recent call last): [rank4]: File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner [rank4]: self.run() [rank4]: File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run [rank4]: self._target(*self._args, **self._kwargs) [rank4]: File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward [rank4]: backward_stage.backward_one_chunk( [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk [rank4]: grads_input, _ = self.backward_maybe_with_nosync( [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync [rank4]: result = perform_backward(backward_type)() [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda> [rank4]: stage_backward( [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward [rank4]: raise RuntimeError(exc_msg) from e [rank4]:RuntimeError: [rank4]: Failed to run stage backward: [rank4]: Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',) [rank4]: Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',) [rank4]: Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)'] ```
recomputing MoE during backward ``` [rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.) [rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank4]:Exception in thread Thread-2 (run_backward): [rank4]:Traceback (most recent call last): [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward [rank4]: torch.autograd.backward( [rank4]: File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward [rank4]: _engine_run_backward( [rank4]: File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward [rank4]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details) [rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first): [rank4]:C++ CapturedTraceback: [rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function< std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 [rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 [rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 [rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0 [rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0 [rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void* , c10::cuda::CUDAStream&) from ??:0 [rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ? ?:0 [rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0 [rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c 10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> > , bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0 [rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c 10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0 [rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0 [rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0 [rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ? ?:0 [rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std:: allocator<c10::IValue> >*) from :0 [rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c 10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0 [rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri ng<char, std::char_traits<char>, std::allocator<char> >) from :0 [rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> > &&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1 1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0 [rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0 [rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0 [rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu e> const&) from ??:0 [rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0 [rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0 [rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0 [rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0 [rank4]:pytorch#28 start_thread from ??:0 [rank4]:pytorch#29 __clone3 from :0 [rank4]: [rank4]: [rank4]:The above exception was the direct cause of the following exception: [rank4]: [rank4]:Traceback (most recent call last): [rank4]: File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner [rank4]: self.run() [rank4]: File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run [rank4]: self._target(*self._args, **self._kwargs) [rank4]: File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward [rank4]: backward_stage.backward_one_chunk( [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk [rank4]: grads_input, _ = self.backward_maybe_with_nosync( [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync [rank4]: result = perform_backward(backward_type)() [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda> [rank4]: stage_backward( [rank4]: File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward [rank4]: raise RuntimeError(exc_msg) from e [rank4]:RuntimeError: [rank4]: Failed to run stage backward: [rank4]: Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',) [rank4]: Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',) [rank4]: Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)'] ```
Stack from ghstack (oldest at bottom):
As titled. Only enable the gradient scaler if mixed precision training is used with parameter dtype being fp16.