Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compile torch-xla form source? #8847

Open
south-ocean opened this issue Mar 18, 2025 · 11 comments
Open

How to compile torch-xla form source? #8847

south-ocean opened this issue Mar 18, 2025 · 11 comments
Labels
build Build process related matters (e.g. build system). question

Comments

@south-ocean
Copy link

south-ocean commented Mar 18, 2025

❓ Questions and Help

I have reviewed the relevant materials on torch-xla but have not found a clear guide on how to compile torch-xla from source. The instructions mentioned on this page are somewhat disorganized. Could you provide a detailed compilation process? I need to build it from source to verify my modifications. Thanks
Now I am use python setup.py develop to build from source code , but encounter ERROR as follows:
the command is
XLA_CUDA=1 python setup.py install , and i am use the torch-xla v2.5.1

Image

@ysiraichi ysiraichi added question build Build process related matters (e.g. build system). labels Mar 18, 2025
@ysiraichi
Copy link
Collaborator

I have a couple of questions:

  • What OS are you using?
  • Are you using PyTorch/XLA development images? If not, I recommend doing so.
  • Could you try setting LD_LIBRARY_PATH=/usr/local/lib?

@iwknow
Copy link
Contributor

iwknow commented Mar 18, 2025

FYI: I followed the instructions here https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md without any problem.

Based on the error message. i highly suspect that some PATH setting is wrong.

@south-ocean
Copy link
Author

My os is ubuntu20.04.1, and i am use not use the [PyTorch/XLA development images]

  • LD_LIBRARY_PATH=/usr/local/lib

My operating system is Ubuntu 20.04.1, and I am using the open-source Ubuntu 20.04 image instead of the one you mentioned. I have already tried using export LD_LIBRARY_PATH=/usr/local/lib, but the error still persists.

@south-ocean
Copy link
Author

FYI: I followed the instructions here https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md without any problem.

Based on the error message. i highly suspect that some PATH setting is wrong.

yeah, I have seen this, but i am not find the XLA_GPU setting, so this can also work for nvidia GPU?

@iwknow
Copy link
Contributor

iwknow commented Mar 19, 2025

FYI: I followed the instructions here https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md without any problem.
Based on the error message. i highly suspect that some PATH setting is wrong.

yeah, I have seen this, but i am not find the XLA_GPU setting, so this can also work for nvidia GPU?

according to https://github.com/pytorch/xla/blob/r2.0/docs/pjrt.md, GPU is supported. Here is a possibly outdated guide for GPU just for your reference: https://github.com/pytorch/xla/blob/r2.0/docs/gpu.md

@south-ocean
Copy link
Author

south-ocean commented Mar 19, 2025

@iwknow After through ldconfig /usr/local/lib:/usr/lib, i am append the path, but if failed on the /torch_xla/csrc/runtime/ifrt_computation_client.h:197:31: error: ‘ToPrimitiveType’ is not a member of ‘xla::ifrt’
197 | xla::ifrt::ToPrimitiveType(buffer->dtype()).value(),

Image

@iwknow
Copy link
Contributor

iwknow commented Mar 19, 2025

after a quick search, ToPrimitiveType is a function defined in openXLA https://github.com/openxla/xla/blob/efb03d062482f50c67dae9b5b909fd8bd0f1ed04/xla/python/pjrt_ifrt/pjrt_dtype.cc#L26

I feel you still have issue in the environment setup, especially the dependencies. building it locally requires a correct enviornment which is a pretty complicated thing to do. i recommend to use one of the docker container described here to set up your environment.

@south-ocean
Copy link
Author

after a quick search, ToPrimitiveType is a function defined in openXLA https://github.com/openxla/xla/blob/efb03d062482f50c67dae9b5b909fd8bd0f1ed04/xla/python/pjrt_ifrt/pjrt_dtype.cc#L26

I feel you still have issue in the environment setup, especially the dependencies. building it locally requires a correct enviornment which is a pretty complicated thing to do. i recommend to use one of the docker container described here to set up your environment.

I understand your point, but I have a question. Compared to TensorFlow and JAX, which use XLA compilation with environment variable settings based on Bazel, I haven't seen any compilation settings in the torch-xla documentation, only python3 setup.py develop. Additionally, XLA is defined in the WORKSPACE file, so theoretically there shouldn't be any issues; otherwise, the files wouldn't have been retrieved in the first place. it looks stranges.

@ysiraichi
Copy link
Collaborator

Apparently we started importing pjrt_dtype.h (which has the declaration to that function) after the OpenXLA pin update in #8267. Before that, the declaration lived inside pjrt_array.h (which we were including in 2.5.1).

Are you using a different OpenXLA version? That's the only way I can think of why you are bumping into this error. If so, a quick fix would be to add #include "xla/python/pjrt_ifrt/pjrt_dtype.h" to torch_xla/csrc/runtime/ifrt_computation_client.h. But a better solution would be to base your OpenXLA version on the commit specified in the WORKSPACE file.

If that's not what you are doing, could you try compiling the latest PyTorch/XLA?

@south-ocean
Copy link
Author

Apparently we started importing pjrt_dtype.h (which has the declaration to that function) after the OpenXLA pin update in #8267. Before that, the declaration lived inside pjrt_array.h (which we were including in 2.5.1).

Are you using a different OpenXLA version? That's the only way I can think of why you are bumping into this error. If so, a quick fix would be to add #include "xla/python/pjrt_ifrt/pjrt_dtype.h" to torch_xla/csrc/runtime/ifrt_computation_client.h. But a better solution would be to base your OpenXLA version on the commit specified in the WORKSPACE file.

If that's not what you are doing, could you try compiling the latest PyTorch/XLA?

Thanks, i read the source code, and fix that error, but when i run python3 test_train_mp_imagenet.py --fake_data, it can execute. But when the end of the command, it appears malloc(): unsorted double linked list corrupted, do you have any suggestions?

Image

@ysiraichi
Copy link
Collaborator

I have not seen an error like this before.
I'd suggest you to open an issue describing exactly what your setup is and what exactly you are trying to do, so that we can try and reproduce on our end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Build process related matters (e.g. build system). question
Projects
None yet
Development

No branches or pull requests

3 participants