Skip to content

Conversation

@lessw2020
Copy link
Contributor

this PR adds a working meta init option, via the file meta_init.py. This is to allow extremely large models to init without overflowing cpu memory.

1 - Meta init is controlled via the cmd line arg, --meta_init
2 - If true, then a context wrapper is applied during model instantiation such that params are routed to meta device.
2a - meta_init status is conveyed along with model config to allow for parallelization to proceed as appropriate (i.e. to('cuda') or not.

3 - during FSDP init, the param_init_fn is fired which calls back into the meta init context and weights are then materialized onto each cuda device.
4 - Importantly, weights are designated as randn to enforce a uniform distribution to allow training.

If meta_init is not used, then init is done as normally.

Tested with and without meta_init.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 27, 2024
@lessw2020 lessw2020 requested a review from wanchaol January 28, 2024 01:40
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see inline comments

# train loop
model.train()

# use fsdp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this?

model = model_cls.from_model_args(model_config)
# meta initialization
_use_meta_init = args.meta_init # todo - add to toml
model_config.use_meta_init = _use_meta_init # append this to model config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we save the meta init flag to the model config? I think it's sth orthogonal to the model arch?

max_batch_size: int = 32
max_seq_len: int = 32768

use_meta_init: Optional[bool] = False # controlled via global settings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is more like a TrainOptions instead of a model specific config, let's not include it here.

@torch.no_grad()
def meta_to_real_init_fn(module: nn.Module):
for submodule in module.modules():
for param_name, param in submodule.named_parameters(recurse=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a TODO here and link the issue. I think doing random init might not be good enough fore pretraining and we should resolve the layer depth init functions.



@contextmanager
def meta_model_init():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I don't think we need this, a simple init like this should work:

with torch.device("meta"):
    model = Model.from_args(...)



@torch.no_grad()
def meta_to_real_init_fn(module: nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: so we don't define reset_parameters on the nn.Module and instead we are using this function, is it because the reset_parameters does not work with FSDP?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset_parameters() should work (now) with FSDP as long as reset_parameters() only initializes the module's directly owned parameters/buffers and not any of children.

"--compile", action="store_true", help="Whether to compile the model."
)

parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if we should even make this optional. It might be cleaner if we just always do meta init. then we apply various parallelisms, then we materialize. thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a great idea imo...not just for being cleaner but b/c meta-init has been a bit of a second class citizen last year, yet we see lots of partners struggling with larger model training due to OOM on loading and unclear how to leverage meta init.
Thus, always using meta init here also would ensure full support (i.e. any new feature that breaks with it) and provide an always working, proper code example.

@wconstab
Copy link
Contributor

wconstab commented Feb 2, 2024

could you try rebasing before you merge? you'll pick up the linter CI that way and itll force you to lint your new files. Hopefully not too many conflicts to resolve since most of the hairy linter changes were in model.py, but apologies in advance

@lessw2020
Copy link
Contributor Author

could you try rebasing before you merge? you'll pick up the linter CI that way and itll force you to lint your new files. Hopefully not too many conflicts to resolve since most of the hairy linter changes were in model.py, but apologies in advance

Yes no problem. I didn't merge this yet b/c reset parameters should be working now so I wanted to update this PR to use that, but will also synch up with the linter etc.

@lessw2020
Copy link
Contributor Author

moved to new PR to avoid too many merge conflicts.

@lessw2020 lessw2020 closed this Mar 6, 2024
wconstab added a commit that referenced this pull request Nov 21, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:#27 std::error_code::default_error_condition() const from ??:0
[rank4]:#28 start_thread from ??:0
[rank4]:#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Dec 2, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0
[rank4]:pytorch#28 start_thread from ??:0
[rank4]:pytorch#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants