Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] hf_GPT2 (large, too) fails to run on bfloat16 dtype. #6521

Closed
ysiraichi opened this issue Feb 12, 2024 · 1 comment · Fixed by #6745
Closed

[torchbench] hf_GPT2 (large, too) fails to run on bfloat16 dtype. #6521

ysiraichi opened this issue Feb 12, 2024 · 1 comment · Fixed by #6745
Labels

Comments

@ysiraichi
Copy link
Collaborator

ysiraichi commented Feb 12, 2024

🐛 Bug

After converting the hf_GPT2 (and its large variation) model to bfloat16 and running it (see command below), it fails with the following error:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda \
    --xla PJRT --dynamo None --test eval \
    --no-resume --print-subprocess \
    -k hf_GPT2
2024-02-10 02:55:58.205631: F ./torch_xla/csrc/runtime/debug_macros.h:20] Non-OK-status: status.status() status: INTERNAL: during context [Unknown]: Seen floating point types of different precisions in %add.281 = f32[16384,768]{1,0} add(f32[16384,768]{1,0} %dot.276, bf16[16384,768]{1,0} %broadcast.280), but mixed precision is disallowed.
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > ConsumeValue<std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > >(absl::lts_20230802::StatusOr<std::unique_ptr<xla::PjRtLoadedExecutable, std::default_delete<xla::PjRtLoadedExecutable> > >&&)
        torch_xla::runtime::PjRtComputationClient::Compile(std::vector<torch_xla::runtime::ComputationClient::CompileInstance, std::allocator<torch_xla::runtime::ComputationClient::CompileInstance> >)
        torch_xla::XLAGraphExecutor::Compile(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > > const&, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorCollection const&, torch::lazy::LazyGraphExecutor::PostOrderData*, std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&)
        torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
        torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, bool, bool, bool)
        torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool)




        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode



        PyRun_SimpleFileExFlags
        Py_RunMain
        Py_BytesMain
        __libc_start_main
        _start
*** End stack trace ***

To Reproduce

Affected Configurations

  • Non-Dynamo Inference
  • Dynamo Inference
  • Dynamo Training

Environment

  • Reproducible on XLA backend [CPU/TPU]: CUDA
  • torch_xla version: 408b376

cc @miladm

@ysiraichi
Copy link
Collaborator Author

Both hf_GPT2 and hf_GPT2_large seem to be failing with this same error in the following cases:

  • Non-dynamo: inference
  • Dynamo: inference + training

@ysiraichi ysiraichi changed the title [benchmarks] hf_GPT2 (large, too) fails to run on bfloat16 dtype. [torchbench] hf_GPT2 (large, too) fails to run on bfloat16 dtype. Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant