Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] vision_maskrcnn failing on inference with dynamo after bfloat16 conversion. #6557

Closed
ysiraichi opened this issue Feb 16, 2024 · 2 comments · Fixed by #7113
Closed
Labels

Comments

@ysiraichi
Copy link
Collaborator

🐛 Bug

After converting the vision_maskrcnn model to bfloat16, running inference on dynamo raises the following error:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda \
    --xla PJRT --dynamo openxla --test eval \
    --no-resume --print-subprocess \
    -k vision_maskrcnn
F0000 00:00:1708093149.740541   20316 debug_macros.h:20] Non-OK-status: status.status() status: INVALID_ARGUMENT: Slice size at index 0 in gather op is out of range, must be within [0, 1), got 1.
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        xla::Shape const* ConsumeValue<xla::Shape const*>(absl::lts_20230802::StatusOr<xla::Shape const*>&&)
        torch_xla::ShapeHelper::ShapeOfXlaOp(xla::XlaOp)
        torch_xla::InferOutputShape(absl::lts_20230802::Span<xla::Shape const>, std::function<xla::XlaOp (absl::lts_20230802::Span<xla::XlaOp const>)> const&)


        torch_xla::XlaNode::GetOpShape(std::function<xla::Shape ()> const&) const
        torch_xla::XlaNode::XlaNode(torch::lazy::OpKind, c10::ArrayRef<torch::lazy::Value>, std::function<xla::Shape ()> const&, unsigned long, torch::lazy::hash_t)
        torch_xla::IndexGet::IndexGet(torch::lazy::Value const&, torch::lazy::Value const&, long)
        torch_xla::IndexByTensors(c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > const&, absl::lts_20230802::Span<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::
XLATensor> > const>, long)
        torch_xla::tensor_methods::index(c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > const&, absl::lts_20230802::Span<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torc
h_xla::XLATensor> > const>, long)
        torch_xla::XLANativeFunctions::index(at::Tensor const&, c10::List<std::optional<at::Tensor> > const&)


        at::_ops::index_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::List<std::optional<at::Tensor> > const&)


        at::_ops::index_Tensor::call(at::Tensor const&, c10::List<std::optional<at::Tensor> > const&)
        torch::autograd::THPVariable_getitem(_object*, _object*)
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        _PyEval_EvalFrameDefault

        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyObject_FastCallDict
        _PyObject_Call_Prepend

        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault

        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyObject_FastCallDict
        _PyObject_Call_Prepend

        PyObject_Call
        _PyEval_EvalFrameDefault

        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall

        PyVectorcall_Call
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_Vectorcall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode



        PyRun_SimpleFileExFlags
        Py_RunMain
        Py_BytesMain
        __libc_start_main
        _start
*** End stack trace ***

*** Check failure stack trace: ***
    @     0x7f33961176a9  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7f338c1ffde9  ConsumeValue<>()
    @     0x7f338c1ffe3e  torch_xla::ShapeHelper::ShapeOfXlaOp()
    @     0x7f338c173e5c  torch_xla::InferOutputShape()
    @     0x7f338c16a18d  torch_xla::(anonymous namespace)::NodeOutputShape()
    @     0x7f338c16a213  std::_Function_handler<>::_M_invoke()
    @     0x7f338c1f3b96  torch_xla::XlaNode::GetOpShape()
    @     0x7f338c1f4469  torch_xla::XlaNode::XlaNode()
    @     0x7f338c16a38d  torch_xla::IndexGet::IndexGet()
    @     0x7f338c171fc8  torch_xla::IndexByTensors()
    @     0x7f338beea872  torch_xla::tensor_methods::index()
    @     0x7f338be29be4  torch_xla::XLANativeFunctions::index()
    @     0x7f338c0edc1f  c10::impl::make_boxed_from_unboxed_functor<>::call()
    @     0x7f348a089d9b  (anonymous namespace)::functionalizeFallback()
    @     0x7f348ade0e79  at::_ops::index_Tensor::redispatch()
    @     0x55bd4ff82f10  (unknown)

Environment

  • Reproducible on XLA backend [CPU/TPU]: CUDA
  • torch_xla version: 20692cb

cc @miladm @JackCaoG

@ysiraichi ysiraichi changed the title [benchmarks] vision_maskrcnn failing on inference with dynamo after bfloat16 conversion. [torchbench] vision_maskrcnn failing on inference with dynamo after bfloat16 conversion. Feb 29, 2024
@miladm
Copy link
Collaborator

miladm commented Mar 25, 2024

looks like a gather op issue.
@alanwaketan can you review this bug plz?

@JackCaoG
Copy link
Collaborator

I mean the error itself is pretty clear

Slice size at index 0 in gather op is out of range, must be within [0, 1), got 1.

It failed during graph building stage at IndexGet op. I seems like it is trying to access index 1 while array only have 1 element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants