-
Notifications
You must be signed in to change notification settings - Fork 212
build: Add Dockerfile that uses NGC pytorch image #897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
8494039
Optionally build RL with ngc torch container
chtruong814 29a9732
Do not install TE when using pytorch image
chtruong814 c5a07b3
Exclude transformer-engine-cu-12 when using pytorch container
chtruong814 9ff3a6b
Do not install custom python in ngc torch container
chtruong814 ced97c4
Add no-install-pytorch-deps
chtruong814 fc49eb3
Use system executable
chtruong814 d711bc3
Update mcore ref to use 0.13.0 fork with sahil cherry-picks
chtruong814 51fe21c
Revert original Dockerfile
chtruong814 bbee575
Revert "Revert original Dockerfile"
chtruong814 32e09ff
Build vllm with uv
chtruong814 b452cc0
Fix vllm build in uv
chtruong814 9dba4a7
Fix vllm output directory
chtruong814 ddf4d3c
Revert docker container
chtruong814 0c73b13
Add ngc pytorch Dockerfile
chtruong814 26b42a0
Use system executable given NEMO_RL_PY_EXECUTABLES_SYSTEM
chtruong814 45c9f9a
Ensure uv is installed
chtruong814 a15ac1c
Fix vllm build
chtruong814 bed0c3d
Fix vllm install
chtruong814 534e426
Fix no install env var
chtruong814 f0ad336
Fix uv install
chtruong814 d5517cf
Ensure numpy is not upgraded during install
chtruong814 150b17f
Fix ngc override location
chtruong814 653a39b
Do not use uv to build vllm
chtruong814 b17c3c6
Fix vllm build
chtruong814 aed522f
Build vllm with pip
chtruong814 4a2eec4
Ensure numpy is not upgraded during vllm install
chtruong814 2b05167
Do not install triton
chtruong814 3a1ddd0
Attempt to override numpy and vllm install
chtruong814 54f69b5
Install vllm with no-deps initially
chtruong814 1006b12
Use uv to install vllm with no-deps
chtruong814 0616be6
Prevent uv sync from removing vllm
chtruong814 92df91f
Add UV_CACHE_DIR
chtruong814 1c8bfb5
Remove ngc overrides file
chtruong814 eef6b59
Remove ngc overrides file from .gitignore
chtruong814 8995d0c
Remove prefetch in original Dockerfile
chtruong814 8675d2f
Fix lint error in ray actor registry
chtruong814 ea9803c
Revert "Update mcore ref to use 0.13.0 fork with sahil cherry-picks"
chtruong814 4e07ced
Update Dockerfile based on feedback
chtruong814 0096d65
Remove unused file from .gitignore
chtruong814 b734d24
Revert Dockerfile
chtruong814 aba3b85
Merge remote-tracking branch 'origin/main' into chtruong/build-ngc-torch
chtruong814 bfe5610
Add comment around installing new dependencies
chtruong814 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # This Dockerfile is used to build a Docker image for NeMo RL with the NGC PyTorch base image. | ||
| # However, it is still a work in progress and is not yet ready for production use. | ||
| # | ||
| # Usage: | ||
| # Self-contained build (default: builds from main): docker buildx build -f docker/Dockerfile.ngc_pytorch --tag <registry>/nemo-rl:latest --push . | ||
| # Self-contained build (specific git ref): docker buildx build -f docker/Dockerfile.ngc_pytorch --build-arg NRL_GIT_REF=r0.3.0 --tag <registry>/nemo-rl:r0.3.0 --push . | ||
| # Self-contained build (remote NeMo RL source; no need for a local clone of NeMo RL): docker buildx build -f docker/Dockerfile.ngc_pytorch --build-arg NRL_GIT_REF=r0.3.0 --tag <registry>/nemo-rl:r0.3.0 --push https://github.com/NVIDIA-NeMo/RL.git | ||
| # Local NeMo RL source override: docker buildx build --build-context nemo-rl=. -f docker/Dockerfile.ngc_pytorch --tag <registry>/nemo-rl:latest --push . | ||
| # | ||
| # If installing new dependencies in the container, then use "uv pip install new-dependency" | ||
| ARG BASE_IMAGE=nvcr.io/nvidia/nemo:25.06-py3 | ||
chtruong814 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| FROM scratch AS nemo-rl | ||
| ARG NRL_GIT_REF=main | ||
| ADD --keep-git-dir=true https://github.com/NVIDIA-NeMo/RL.git#${NRL_GIT_REF} / | ||
|
|
||
| FROM ${BASE_IMAGE} AS base | ||
|
|
||
| # It is more convenient for users to run as root | ||
| USER root | ||
|
|
||
| RUN <<"EOF" bash -exu -o pipefail | ||
| export DEBIAN_FRONTEND=noninteractive | ||
| export TZ=America/Los_Angeles | ||
|
|
||
| apt-get update | ||
| apt-get install -y --no-install-recommends \ | ||
| jq \ | ||
| curl \ | ||
| git \ | ||
| rsync \ | ||
| wget \ | ||
| less \ | ||
| vim \ | ||
|
|
||
|
|
||
| apt-get clean | ||
| rm -rf /var/lib/apt/lists/* | ||
| EOF | ||
|
|
||
| # Install uv at /usr/local/bin in case the root home directory is bind mounted | ||
| ARG UV_VERSION=0.7.2 | ||
| RUN curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | XDG_BIN_HOME=/usr/local/bin sh | ||
|
|
||
| # Disable usage stats by default for users who are sensitive to sharing usage. | ||
| # Users are encouraged to enable if they wish. | ||
| ENV RAY_USAGE_STATS_ENABLED=0 | ||
| ENV NEMO_RL_VENV_DIR=/opt/ray_venvs | ||
|
|
||
| # Build vLLM from source to use with the NVIDIA PyTorch base image | ||
| FROM base AS build_vllm | ||
|
|
||
| ARG MAX_JOBS=32 | ||
| WORKDIR /opt | ||
| COPY --from=nemo-rl uv.lock /tmp/uv.lock | ||
|
|
||
| RUN <<"EOF" bash -exu | ||
| echo "Building vLLM from source for PyTorch base image" | ||
| VLLM_VERSION=$(grep -A 1 'name = "vllm"' /tmp/uv.lock | grep 'version =' | sed 's/version = "\(.*\)"/\1/') && \ | ||
| echo "Building vLLM version: $VLLM_VERSION" | ||
| git clone https://github.com/vllm-project/vllm.git | ||
| cd vllm | ||
| git checkout v$VLLM_VERSION | ||
| python use_existing_torch.py | ||
| pip install -r requirements/build.txt | ||
| pip wheel --no-deps --no-build-isolation -v . | ||
| EOF | ||
|
|
||
| FROM base AS hermetic | ||
|
|
||
| WORKDIR /opt/nemo-rl | ||
|
|
||
| # Variables to control the build of TE. If there are issues with parallelization, consider | ||
| # setting these to 1. | ||
| ARG MAX_JOBS | ||
| ARG NVTE_BUILD_THREADS_PER_JOB | ||
|
|
||
| ENV UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv | ||
| ENV UV_CACHE_DIR=/opt/uv_cache | ||
| ENV UV_LINK_MODE=copy | ||
|
|
||
| # Define the no-install-package arguments for PyTorch base images | ||
| ARG BASE_IMAGE | ||
| ARG UV_NO_INSTALL_PACKAGES="--no-install-package torch --no-install-package torchvision --no-install-package triton --no-install-package nvidia-cublas-cu12 --no-install-package nvidia-cuda-cupti-cu12 --no-install-package nvidia-cuda-nvrtc-cu12 --no-install-package nvidia-cuda-runtime-cu12 --no-install-package nvidia-cudnn-cu12 --no-install-package nvidia-cufft-cu12 --no-install-package nvidia-cufile-cu12 --no-install-package nvidia-curand-cu12 --no-install-package nvidia-cusolver-cu12 --no-install-package nvidia-cusparse-cu12 --no-install-package nvidia-cusparselt-cu12 --no-install-package nvidia-nccl-cu12 --no-install-package vllm --no-install-package flash-attn --no-install-package transformer-engine --no-install-package transformer-engine-cu12 --no-install-package transformer-engine-torch --no-install-package numpy" | ||
| ENV UV_NO_INSTALL_PACKAGES=${UV_NO_INSTALL_PACKAGES} | ||
| ENV PATH="/opt/nemo_rl_venv/bin:$PATH" | ||
|
|
||
| # First copy only the dependency files | ||
| COPY --from=nemo-rl pyproject.toml uv.lock ./ | ||
| COPY --from=nemo-rl --link 3rdparty/ ./3rdparty/ | ||
|
|
||
|
|
||
| RUN --mount=type=bind,from=build_vllm,source=/opt/,target=/tmp/build_vllm/ <<"EOF" bash -exu | ||
|
|
||
| # uv sync has a more reliable resolver than simple uv pip install which can fail | ||
| # The venv is symlinked to avoid bloating the layer size | ||
| uv venv --system-site-packages ${UV_PROJECT_ENVIRONMENT} | ||
| uv pip install --no-cache-dir --no-deps /tmp/build_vllm/vllm/vllm*.whl | ||
| uv sync --link-mode symlink --locked --inexact --extra vllm --extra mcore --extra automodel --all-groups --no-install-project $UV_NO_INSTALL_PACKAGES | ||
| EOF | ||
|
|
||
| ENV NEMO_RL_VENV_DIR=/opt/ray_venvs | ||
|
|
||
| WORKDIR /opt/nemo-rl | ||
|
|
||
| FROM hermetic AS release | ||
|
|
||
| ARG NEMO_RL_COMMIT | ||
| ARG NVIDIA_BUILD_ID | ||
| ARG NVIDIA_BUILD_REF | ||
| ENV UV_NO_SYNC=1 | ||
| ENV NEMO_RL_COMMIT=${NEMO_RL_COMMIT:-<unknown>} | ||
| ENV NVIDIA_BUILD_ID=${NVIDIA_BUILD_ID:-<unknown>} | ||
| ENV NVIDIA_BUILD_REF=${NVIDIA_BUILD_REF:-<unknown>} | ||
| ENV NEMO_RL_PY_EXECUTABLES_SYSTEM=1 | ||
| # The 25.06 Pytorch container is not compatible with vllm standalone compile so we disable it | ||
| ENV VLLM_USE_STANDALONE_COMPILE=0 | ||
chtruong814 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| LABEL com.nvidia.build.id="${NVIDIA_BUILD_ID}" | ||
| LABEL com.nvidia.build.ref="${NVIDIA_BUILD_REF}" | ||
|
|
||
| ENV NEMO_RL_VENV_DIR=/opt/ray_venvs | ||
|
|
||
| # Copy in source from build context (defaults to cloned repo, can be overridden) | ||
| COPY --from=nemo-rl . /opt/nemo-rl | ||
| # Unshallow the repo to get the full history (in the case it was from the scratch layer). | ||
| # Potentially not necessary if the repo is passed in as a complete repository (w/ full git history), | ||
| # so do a quick check before trying to unshallow. | ||
| RUN git rev-parse --is-shallow-repository | grep -q true && git fetch --unshallow || true | ||
| RUN UV_LINK_MODE=symlink uv sync --locked --inexact $UV_NO_INSTALL_PACKAGES | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.