Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU Trillium Base Docker Image cannot initialize #8371

Open
hsebik opened this issue Nov 12, 2024 · 6 comments
Open

TPU Trillium Base Docker Image cannot initialize #8371

hsebik opened this issue Nov 12, 2024 · 6 comments

Comments

@hsebik
Copy link

hsebik commented Nov 12, 2024

TPU initialization is failed

When I started tpu v6e-4 TPU Vm with v2-alpha-tpuv6e base image, with pip enviroment and xla updates I can clearly initialized tpus. However when I start to dockerize my pipelie, it fails to initialize TPUs. I tried so much tpu xla base images but I could not achieve to initialize. This happens everytime get device from torch_xla.core.xla_model.xla_device().

I have checked this base images. I guess v2-alpha-tpuv6e configuration is crucial, is there any related base docker image?

us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20241028

us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-tpu-diffusers:v4

To Reproduce

DevDockerfile

FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20241028

# Set environment variables to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y \
    vim \
    curl \
    git \
    bash \
    wget \
    libopenblas-base \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install  --no-cache-dir --pre torch==2.6.0.dev20241028+cpu torchvision==0.20.0.dev20241028+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
RUN pip install "torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev20241028-cp310-cp310-linux_x86_64.whl" -f https://storage.googleapis.com/libtpu-releases/index.html
RUN pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
COPY . .
CMD ["python3", "app.py"]

#app.py

# Quite simple to reproduce
import torch_xla.core.xla_model as xm

#Hangs in here not initilize tpu.
device = xm.xla_device()

Both file are in same directory. Generate docker with
docker build -f DevDockerfile -t tpu .
Then run with privileged.
docker run -ti --rm -p 5000:5000 --privileged tpu

Expected behavior

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]:
  • torch_xla version: torch_xla-2.6.0.dev20241028-cp310
@hsebik
Copy link
Author

hsebik commented Nov 12, 2024

Dear @JackCaoG,
This issue looks like same with earlier TPU versions. Is there an any chance to check?
#3132.

@JackCaoG
Copy link
Collaborator

@tengyifei do you know if Trillium requires a special libtpu? AFAIK new releases docker should work on a Trillium as long as base image is right.

@tengyifei
Copy link
Collaborator

The appropriate libtpu should be baked into us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-tpu-diffusers:v4 and us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20241028. I checked that pytorch-tpu-diffusers:v4 has a libtpu from 9/13 which had been used on v6e before.

Is there any error logs in particular? What if you add some logging env vars according to https://github.com/pytorch/xla/blob/master/docs/source/learn/troubleshoot.md#environment-variables

@hsebik
Copy link
Author

hsebik commented Nov 13, 2024

Thank you for your quick response. I am on TPU v6e VM with 4 core. When I tried to start docker I get error in tpu initialization phase. I am building and running my docker inside the my tpu vm. In the tpu-recipes benchmark code it is running outside vm and ssh into vm with --worker=all and gives command. With tpu-recipes version it worked but why my docker run version is not working?

$ sudo docker run -ti --rm -p 5000:5000 --privileged -e XLA_IR_DEBUG=1 -e XLA_HLO_DEBUG=1 tpu python3
Python 3.10.15 (main, Oct 19 2024, 03:59:09) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch_xla.core.xla_model as xm
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
>>> device = xm.xla_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 167, in xla_device
    return runtime.xla_device(n, devkind)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 118, in xla_device
    return torch.device(torch_xla._XLAC._xla_get_default_device())
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Failed to connect to [::]:8353
>>> exit()

However when I start without docker my code and this example works well.

$ python3
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch_xla.core.xla_model as xm
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
>>> device = xm.xla_device()
>>> print(device)
xla:0

@zpcore
Copy link
Collaborator

zpcore commented Nov 17, 2024

The libtpu version used w and w/o docker seem to be different. I saw the same issue due to the broken libtpu in some October releases on v6e. Maybe nightly_3.10_tpuvm_20241028 is one of the broken versions? Can you try install a different version in the docker build file like

RUN pip install https://storage.googleapis.com/libtpu-nightly-releases/wheels/libtpu-nightly/libtpu_nightly-0.1.dev20240913+nightly-py3-none-any.whl

@miladm
Copy link
Collaborator

miladm commented Nov 18, 2024

@tengyifei @zpcore seems like we can improve our workflow to help our users avoid this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants