-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feat] Use a shell context to install dependancies #7383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][feat] Use a shell context to install dependancies #7383
Conversation
📝 WalkthroughWalkthroughReplaces per-component Docker install steps with a centralized Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant Docker as Docker build
participant Devel as devel stage
participant Install as install.sh
participant Sub as per-component scripts
Dev->>Docker: docker build (multi-stage)
Docker->>Devel: build devel stage
Devel->>Install: bash install.sh --base --cmake ... --protobuf
Install->>Sub: invoke install_base.sh / install_cmake.sh / ...
Sub-->>Install: completed
Install-->>Devel: setup complete
rect rgba(220,235,255,0.35)
note over Docker,Devel: Triton integration (new stages)
participant Triton as triton
participant TritonDevel as tritondevel
Docker->>Triton: FROM ${TRITON_IMAGE}:${TRITON_BASE_TAG}
Triton-->>TritonDevel: copy UCX/etcd/etc.
TritonDevel->>Install: run install_triton.sh / orchestrated steps
end
sequenceDiagram
autonumber
actor User
participant SL as srun (build_sqsh)
participant Container as BaseImage
participant Install as docker/common/install.sh
participant SQ as tensorrt_llm.devel.sqsh
participant Run as srun (run_sqsh)
User->>SL: make -C enroot build_sqsh
SL->>Container: start $(BASE_IMAGE):$(BASE_TAG)
Container->>Install: ./install.sh --all (env: TRT_VER, CUDA_VER, ...)
Install-->>Container: components installed
Container->>SQ: save SQSH
User->>Run: make -C enroot run_sqsh
Run->>SQ: launch enroot container
SQ-->>User: interactive session (RUN_CMD)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (10)
docker/common/install.sh (6)
16-21: Gate noisy version echos behind selection.Printing empty TRT/CUDA/etc. values on every run is noisy. Only echo them when
--tensorrtis chosen.-echo "TRT_VER: $TRT_VER" -echo "CUDA_VER: $CUDA_VER" -echo "CUDNN_VER: $CUDNN_VER" -echo "NCCL_VER: $NCCL_VER" -echo "CUBLAS_VER: $CUBLAS_VER" +# Shown later only if --tensorrt is selected
36-96: Add--helpand usage; fail fast on no flags.Small UX improvement for shell usage on HPC jobs.
+usage() { + cat <<'EOF' +Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt] + [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf] [--all] +Env: + GITHUB_MIRROR Propagated to installers (optional) + TORCH_INSTALL_TYPE e.g., wheels|source|skip (default: wheels) +EOF +} + while [[ $# -gt 0 ]]; do case $1 in + -h|--help) + usage; exit 0 ;; --base) base=1 shift 1 ;; @@ --all) base=1 cmake=1 ccache=1 cuda_toolkit=1 tensorrt=1 polygraphy=1 mpi4py=1 pytorch=1 opencv=1 protobuf=1 shift 1 ;; *) echo "Unknown option: $1" - exit 1 + usage; exit 1 ;; esac done + +# Optional: require at least one selection +if (( base + cmake + ccache + cuda_toolkit + tensorrt + polygraphy + mpi4py + pytorch + opencv + protobuf == 0 )); then + echo "No components selected." >&2 + usage; exit 1 +fi
119-123: Same here: use SCRIPT_DIR.-GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh +GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh"
124-132: Add env echo after validation and use SCRIPT_DIR.- echo "Installing TensorRT..." + echo "Installing TensorRT..." + echo "TRT_VER=${TRT_VER} CUDA_VER=${CUDA_VER} CUDNN_VER=${CUDNN_VER} NCCL_VER=${NCCL_VER} CUBLAS_VER=${CUBLAS_VER}" - bash ./install_tensorrt.sh \ + bash "$SCRIPT_DIR/install_tensorrt.sh" \
134-147: Mirror pass-through for all that support it; ensure SCRIPT_DIR and TORCH_INSTALL_TYPE usage.- bash ./install_polygraphy.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_polygraphy.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh" @@ - bash ./install_pytorch.sh $TORCH_INSTALL_TYPE + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"
156-159: Consider constraints and reproducibility for protobuf.Unpinned lower bound can drift. If you have constraints.txt, route through it; otherwise pin via env for reproducibility.
- pip3 install --upgrade --no-cache-dir \ - "protobuf>=4.25.8" + pip3 install --upgrade --no-cache-dir ${PROTOBUF_PKG:-protobuf}${PROTOBUF_VERSION:+==${PROTOBUF_VERSION}} \ + ${PIP_CONSTRAINT:+-c ${PIP_CONSTRAINT}}docker/Dockerfile.multi (4)
42-42: Avoid duplicate PYTORCH_CUDA_ALLOC_CONF definitions.You set it here and inside install.sh; keep one. I’d prefer it in Dockerfile (visibility at build/run).
-ENV PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" +# Already exported by install.sh or keep here and remove from install.sh for clarity.
47-52: Export version envs in a single RUN or useENVfor layer caching clarity.Current pattern works; optionally set them as
ENVthen invoke--tensorrtin a separate RUN to improve cache hits across rebuilds with stable versions.
59-67: Combine cleanup with last install step to reduce layers; consider wildcard.Minor image-size optimization.
-RUN rm install.sh \ - install_base.sh \ - install_cmake.sh \ - install_ccache.sh \ - install_cuda_toolkit.sh \ - install_tensorrt.sh \ - install_polygraphy.sh \ - install_mpi4py.sh \ - install_pytorch.sh +RUN bash ./install.sh --protobuf && \ + rm install.sh install_*.sh
44-57: Option: use a single orchestrator call for build determinism.
RUN bash ./install.sh --allsimplifies the layer graph; but you lose per-step caching. Keep as-is if caching is desired.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docker/Dockerfile.multi(2 hunks)docker/common/install.sh(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T15:04:42.885Z
Learnt from: dbari
PR: NVIDIA/TensorRT-LLM#7095
File: docker/Dockerfile.multi:168-168
Timestamp: 2025-08-20T15:04:42.885Z
Learning: In docker/Dockerfile.multi, wildcard COPY for benchmarks (${CPP_BUILD_DIR}/benchmarks/*Benchmark) is intentionally used instead of directory copy because the benchmarks directory contains various other build artifacts during C++ builds, and only specific benchmark executables should be copied to the final image.
Applied to files:
docker/Dockerfile.multi
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docker/common/install.sh (1)
102-118: Call sub-scripts via SCRIPT_DIR and propagate mirror consistently.Use the resolved path everywhere; mirror propagation is already consistent here.
docker/Dockerfile.multi (1)
9-17: Copying orchestrator and installers looks good.Paths align with RUN usage; no exec-bit needed since invoked via bash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
docker/common/install.sh (3)
138-143: Make OpenCV replacement idempotent and Python-version agnostic; prefer python -m pip.Covers both opencv-python(-headless), removes cv2 from actual site-packages, supports pin via env.
-if [ $opencv -eq 1 ]; then +if [ $opencv -eq 1 ]; then echo "Installing OpenCV..." - pip3 uninstall -y opencv && \ - rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \ - pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir + python3 -m pip uninstall -y opencv opencv-python opencv-python-headless || true + python3 - <<'PY' +import site, shutil, glob, pathlib +paths = [p for p in (site.getsitepackages()+[site.getusersitepackages()]) if p] +for p in paths: + for d in glob.glob(str(pathlib.Path(p)/"cv2*")): + shutil.rmtree(d, ignore_errors=True) +PY + python3 -m pip install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}" + --force-reinstall --no-deps --no-cache-dir fi
1-1: Enable strict mode and resolve script-relative paths; call sub-scripts via absolute paths.Prevents masked failures and CWD-dependent breakage (common with enroot/pyxis). Replace all
./install_*.shinvocations to use the script’s directory.-#!/bin/bash +#!/bin/bash +set -Eeuo pipefail +shopt -s nullglob +trap 'echo "[install.sh] Error on line $LINENO" >&2' ERR + +# Resolve script directory for robust relative pathing +SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cmake.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cmake.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_ccache.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_ccache.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh" @@ - bash ./install_tensorrt.sh \ + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_tensorrt.sh" \ --TRT_VER=${TRT_VER} \ --CUDA_VER=${CUDA_VER} \ --CUDNN_VER=${CUDNN_VER} \ --NCCL_VER=${NCCL_VER} \ --CUBLAS_VER=${CUBLAS_VER} @@ - bash ./install_polygraphy.sh + bash "$SCRIPT_DIR/install_polygraphy.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh" @@ - bash ./install_pytorch.sh $TORCH_INSTALL_TYPE + bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"Also applies to: 95-96, 100-101, 105-106, 110-111, 115-121, 125-126, 130-131, 135-136
83-86: Don’t clobber environment-provided defaults; choose sane fallbacks.Respect upstream-provided GITHUB_MIRROR and TORCH_INSTALL_TYPE; keep PYTORCH_CUDA_ALLOC_CONF configurable.
-GITHUB_MIRROR="" -TORCH_INSTALL_TYPE="skip" -export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" +GITHUB_MIRROR="${GITHUB_MIRROR:-}" +# Prefer prebuilt wheels unless caller overrides +TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}" +export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"
🧹 Nitpick comments (4)
docker/common/install.sh (4)
21-81: Add --help and a usage banner for discoverability.Improves UX and aids CI logs.
+usage() { + cat <<'USAGE' +Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt] + [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf] + [--all] [-h|--help] +Environment: + GITHUB_MIRROR, TORCH_INSTALL_TYPE, PYTHON_VERSION, OPENCV_PKG, OPENCV_VERSION, PROTOBUF_VERSION +USAGE +} @@ case $1 in + -h|--help) + usage + exit 0 + ;; --base)
92-96: Parameterize Python version via env; quote args.Allows callers to override without editing the script.
- PYTHON_VERSION="3.12.3" + PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}" echo "Using Python version: $PYTHON_VERSION" - - GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION" + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"
145-148: Allow protobuf version override and use python -m pip.Keeps flexibility for compatibility bumps.
-if [ $protobuf -eq 1 ]; then - pip3 install --upgrade --no-cache-dir \ - "protobuf>=4.25.8" -fi +if [ $protobuf -eq 1 ]; then + python3 -m pip install --upgrade --no-cache-dir \ + "${PROTOBUF_SPEC:-protobuf>=4.25.8}" +fi
3-7: Minor: gate version echos behind a verbose flag.Reduces noise in CI logs; enable via VERBOSE=1.
-echo "TRT_VER: $TRT_VER" -echo "CUDA_VER: $CUDA_VER" -echo "CUDNN_VER: $CUDNN_VER" -echo "NCCL_VER: $NCCL_VER" -echo "CUBLAS_VER: $CUBLAS_VER" +[ "${VERBOSE:-0}" -eq 1 ] && { + echo "TRT_VER: ${TRT_VER:-}" + echo "CUDA_VER: ${CUDA_VER:-}" + echo "CUDNN_VER: ${CUDNN_VER:-}" + echo "NCCL_VER: ${NCCL_VER:-}" + echo "CUBLAS_VER: ${CUBLAS_VER:-}" +}
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docker/Dockerfile.multi(2 hunks)docker/common/install.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docker/Dockerfile.multi
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docker/common/install.sh (1)
133-136: Confirm interface for install_pytorch.sh and default.Current default TORCH_INSTALL_TYPE=“wheels” assumed; verify the sub-script expects a single positional arg; update to a flag if needed.
Would you like me to scan Dockerfile.multi and install_pytorch.sh to align the invocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
docker/common/install.sh (5)
133-138: Make OpenCV replacement idempotent and Python-version agnostic.Current sequence exits on missing package and hardcodes dist-packages path.
- pip3 uninstall -y opencv && \ - rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \ - pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir + pip3 uninstall -y opencv opencv-python opencv-python-headless || true + python3 - <<'PY' +import site, shutil, glob, pathlib +paths = [p for p in (site.getsitepackages() + [site.getusersitepackages()]) if p] +for p in paths: + for d in glob.glob(str(pathlib.Path(p)/"cv2*")): + shutil.rmtree(d, ignore_errors=True) +PY + pip3 install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}" \ + --force-reinstall --no-deps --no-cache-dir
1-3: Add trap, nullglob, and resolve SCRIPT_DIR for robust execution across CWDs.Prevents masked failures and fixes relative-path breakage when invoked from outside docker/common.
#!/bin/bash set -Eeuo pipefail + +shopt -s nullglob +trap 'echo "[install.sh] Error on line $LINENO" >&2' ERR + +# Resolve script directory for robust relative pathing +SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"
78-81: Don’t clobber env; provide sane defaults instead.Respect caller-supplied GITHUB_MIRROR/TORCH_INSTALL_TYPE and keep CUDA alloc conf overrideable.
-GITHUB_MIRROR="" -TORCH_INSTALL_TYPE="skip" -export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" +GITHUB_MIRROR="${GITHUB_MIRROR:-}" +# Default to prebuilt wheels unless explicitly overridden +TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}" +# Keep overrideable for standalone usage +export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"
93-131: Call sub-scripts via SCRIPT_DIR and quote all arguments.Fixes failures when CWD != script dir (e.g., enroot/srun usage) and avoids word-splitting.
- GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cmake.sh + GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_cmake.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_ccache.sh + GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_ccache.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh + GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_cuda_toolkit.sh" @@ - bash ./install_tensorrt.sh \ + GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_tensorrt.sh" \ --TRT_VER=${TRT_VER} \ --CUDA_VER=${CUDA_VER} \ --CUDNN_VER=${CUDNN_VER} \ --NCCL_VER=${NCCL_VER} \ --CUBLAS_VER=${CUBLAS_VER} @@ - bash ./install_polygraphy.sh + bash "${SCRIPT_DIR}/install_polygraphy.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh + GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_mpi4py.sh" @@ - bash ./install_pytorch.sh $TORCH_INSTALL_TYPE + bash "${SCRIPT_DIR}/install_pytorch.sh" "${TORCH_INSTALL_TYPE}"
76-81: Validate required vars for TensorRT; fail fast with a clear error.Prevents opaque failures when TRT/CUDA/CUDNN/NCCL/CUBLAS versions aren’t set.
done +# Require named environment variables to be non-empty +require_vars() { + local missing=() + for v in "$@"; do + if [ -z "${!v:-}" ]; then missing+=("$v"); fi + done + if [ "${#missing[@]}" -ne 0 ]; then + echo "Missing required variables: ${missing[*]}" >&2 + exit 2 + fi +}And invoke before calling install_tensorrt.sh:
if [ $tensorrt -eq 1 ]; then echo "Installing TensorRT..." + require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER
🧹 Nitpick comments (3)
docker/common/install.sh (3)
82-91: Make Python version configurable via env; quote args.Improves portability and reproducibility.
- PYTHON_VERSION="3.12.3" - echo "Using Python version: $PYTHON_VERSION" + PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}" + echo "Using Python version: ${PYTHON_VERSION}" - - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION + GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_base.sh" "${PYTHON_VERSION}"
140-143: Allow configurable protobuf version/package; keep minimum default.Improves flexibility across environments requiring pinned protobuf.
- pip3 install --upgrade --no-cache-dir \ - "protobuf>=4.25.8" + pip3 install --upgrade --no-cache-dir \ + "${PROTOBUF_PKG:-protobuf}${PROTOBUF_VERSION:->=4.25.8}"
16-76: Add --help usage for discoverability.Small UX boost; harmless default behavior.
case $1 in + -h|--help) + cat <<'USAGE' +Usage: docker/common/install.sh [options] + --base --cmake --ccache --cuda_toolkit --tensorrt --polygraphy --mpi4py --pytorch --opencv --protobuf + --all Install all components + -h, --help Show this help +Environment: + GITHUB_MIRROR, TORCH_INSTALL_TYPE (default: wheels), PYTHON_VERSION, OPENCV_PKG/OPENCV_VERSION, + PROTOBUF_PKG/PROTOBUF_VERSION, and version vars for --tensorrt: TRT_VER, CUDA_VER, CUDNN_VER, NCCL_VER, CUBLAS_VER +USAGE + exit 0 + ;;
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docker/common/install.sh(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
584524d to
1a2b1bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
docker/common/install.sh (4)
133-138: Make OpenCV replacement idempotent and Python-version agnostic.Handle both opencv-python(-headless), tolerate absence, and remove cv2 via site-packages discovery.
if [ $opencv -eq 1 ]; then echo "Installing OpenCV..." - pip3 uninstall -y opencv && \ - rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \ - pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir + pip3 uninstall -y opencv opencv-python opencv-python-headless || true + python3 - <<'PY' +import site, shutil, glob, pathlib +paths = [p for p in (site.getsitepackages()+[site.getusersitepackages()]) if p] +for p in paths: + for d in glob.glob(str(pathlib.Path(p)/"cv2*")): + shutil.rmtree(d, ignore_errors=True) +PY + pip3 install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}" \ + --force-reinstall --no-deps --no-cache-dir fi
1-1: Enable script-relative execution of sub-scripts.Calls like “bash ./install_*.sh” break when CWD ≠ script dir (e.g., Enroot). Resolve SCRIPT_DIR and call via absolute paths.
#!/bin/bash set -Eeuo pipefail +shopt -s nullglob + +# Resolve script directory +SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"Replace all sub-script invocations:
- GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cmake.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cmake.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_ccache.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_ccache.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh" @@ - bash ./install_tensorrt.sh \ + bash "$SCRIPT_DIR/install_tensorrt.sh" \ @@ - bash ./install_polygraphy.sh + bash "$SCRIPT_DIR/install_polygraphy.sh" @@ - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh" @@ - bash ./install_pytorch.sh $TORCH_INSTALL_TYPE + bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"
78-81: Preserve env-provided mirrors and set a sensible default for TORCH_INSTALL_TYPE.Don’t clobber GITHUB_MIRROR; default TORCH_INSTALL_TYPE to “wheels” so --pytorch actually installs.
-GITHUB_MIRROR="" -TORCH_INSTALL_TYPE="skip" -export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" +GITHUB_MIRROR="${GITHUB_MIRROR:-}" +TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}" +export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"
108-116: Fail fast if required TRT env versions are missing.Avoid opaque errors in install_tensorrt.sh by validating version variables first.
+require_vars() { + local missing=() + for v in "$@"; do [ -z "${!v:-}" ] && missing+=("$v"); done + if [ "${#missing[@]}" -ne 0 ]; then + echo "Missing required variables: ${missing[*]}" >&2 + exit 2 + fi +} if [ $tensorrt -eq 1 ]; then echo "Installing TensorRT..." + require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER bash "$SCRIPT_DIR/install_tensorrt.sh" \ --TRT_VER=${TRT_VER} \ --CUDA_VER=${CUDA_VER} \ --CUDNN_VER=${CUDNN_VER} \ --NCCL_VER=${NCCL_VER} \ --CUBLAS_VER=${CUBLAS_VER} fidocker/Dockerfile.multi (1)
31-55: Propagate GITHUB_MIRROR and TORCH_INSTALL_TYPE into the devel stage.ARGs don’t cross stages; without re-declaring, install.sh sees empty values and may skip PyTorch.
FROM base AS devel ARG TRT_VER ARG CUDA_VER ARG CUDNN_VER ARG NCCL_VER ARG CUBLAS_VER +ARG GITHUB_MIRROR="" +ARG TORCH_INSTALL_TYPE="wheels" +ENV GITHUB_MIRROR=${GITHUB_MIRROR} +ENV TORCH_INSTALL_TYPE=${TORCH_INSTALL_TYPE} ENV PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" RUN bash ./install.sh --base RUN bash ./install.sh --cmake RUN bash ./install.sh --ccache RUN bash ./install.sh --cuda_toolkit @@ RUN bash ./install.sh --mpi4py -RUN bash ./install.sh --pytorch +RUN bash ./install.sh --pytorch RUN bash ./install.sh --opencv RUN bash ./install.sh --protobuf
🧹 Nitpick comments (12)
enroot/Makefile (5)
10-23: Quote mount paths and mark target as phony.Avoid failures when SOURCE_DIR has spaces and prevent make from treating build_sqsh as a file target.
+.PHONY: build_sqsh build_sqsh: @echo "Building trtllm sqsh image. Base image: $(BASE_IMAGE):$(BASE_TAG). Location: $(SQSH_PATH)" srun \ --container-image $(BASE_IMAGE):$(BASE_TAG) \ --container-save $(SQSH_PATH) \ - --container-mounts $(SOURCE_DIR):/code/tensorrt_llm --container-workdir /code/tensorrt_llm/docker/common \ + --container-mounts "$(SOURCE_DIR)":/code/tensorrt_llm --container-workdir /code/tensorrt_llm/docker/common \ --container-mount-home --container-remap-root \ --container-env TRT_VER \ --container-env CUDA_VER \ --container-env CUDNN_VER \ --container-env NCCL_VER \ --container-env CUBLAS_VER \ ./install.sh --all
1-2: Harden base image/tag derivation and set sane fallbacks.If grep fails, BASE_IMAGE/TAG become empty. Provide defaults to keep make usable.
-BASE_IMAGE ?= $(shell grep '^ARG BASE_IMAGE=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="') -BASE_TAG ?= $(shell grep '^ARG BASE_TAG=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="') +BASE_IMAGE ?= $(shell grep -m1 '^ARG BASE_IMAGE=' ../docker/Dockerfile.multi | sed -E 's/.*= *"?([^" ]+).*/\1/') +BASE_TAG ?= $(shell grep -m1 '^ARG BASE_TAG=' ../docker/Dockerfile.multi | sed -E 's/.*= *"?([^" ]+).*/\1/') +BASE_IMAGE := $(if $(BASE_IMAGE),$(BASE_IMAGE),nvcr.io/nvidia/pytorch) +BASE_TAG := $(if $(BASE_TAG),$(BASE_TAG),25.06-py3)
18-23: Pass through optional env knobs used by install.sh.Forward GITHUB_MIRROR and TORCH_INSTALL_TYPE so users can control mirrors and PyTorch flow.
--container-env NCCL_VER \ --container-env CUBLAS_VER \ + --container-env GITHUB_MIRROR \ + --container-env TORCH_INSTALL_TYPE \ ./install.sh --all
11-11: Consistent naming in status message.Use “TensorRT-LLM” (or “tensorrt_llm”) instead of “trtllm” for clarity.
- @echo "Building trtllm sqsh image. Base image: $(BASE_IMAGE):$(BASE_TAG). Location: $(SQSH_PATH)" + @echo "Building TensorRT-LLM SQSH image. Base image: $(BASE_IMAGE):$(BASE_TAG). Location: $(SQSH_PATH)"
1-1: Optional: add conventional targets flagged by checkmake.Minimal “all” and “clean” targets quell CI warnings without changing behavior.
+.PHONY: all clean test +all: build_sqsh +clean: + @rm -f -- $(SQSH_PATH) +test: + @echo "No tests for enroot Makefile"docker/common/install.sh (2)
87-91: Parameterize Python version for base install.Allow override via env; print actual version used.
- PYTHON_VERSION="3.12.3" - echo "Using Python version: $PYTHON_VERSION" + PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}" + echo "Using Python version: ${PYTHON_VERSION}" - GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "${PYTHON_VERSION}"
71-76: Add a minimal usage/help message.Improves UX for unknown options and CI logs.
- *) - echo "Unknown option: $1" - exit 1 + -h|--help) + echo "Usage: $0 [--base|--cmake|--ccache|--cuda_toolkit|--tensorrt|--polygraphy|--mpi4py|--pytorch|--opencv|--protobuf|--all]" + exit 0 + ;; + *) + echo "Unknown option: $1" >&2 + exit 1docker/Dockerfile.multi (5)
8-17: Copy installer scripts to a fixed path and use absolute paths.Relying on image WORKDIR is brittle across NGC tags. Copy to /opt/tensorrt_llm/install and invoke via absolute path.
-FROM ${BASE_IMAGE}:${BASE_TAG} AS base -COPY docker/common/install.sh install.sh -COPY docker/common/install_base.sh install_base.sh -COPY docker/common/install_cmake.sh install_cmake.sh -COPY docker/common/install_ccache.sh install_ccache.sh -COPY docker/common/install_cuda_toolkit.sh install_cuda_toolkit.sh -COPY docker/common/install_tensorrt.sh install_tensorrt.sh -COPY docker/common/install_polygraphy.sh install_polygraphy.sh -COPY docker/common/install_mpi4py.sh install_mpi4py.sh -COPY docker/common/install_pytorch.sh install_pytorch.sh +FROM ${BASE_IMAGE}:${BASE_TAG} AS base +WORKDIR /opt/tensorrt_llm/install +COPY docker/common/install.sh . +COPY docker/common/install_base.sh . +COPY docker/common/install_cmake.sh . +COPY docker/common/install_ccache.sh . +COPY docker/common/install_cuda_toolkit.sh . +COPY docker/common/install_tensorrt.sh . +COPY docker/common/install_polygraphy.sh . +COPY docker/common/install_mpi4py.sh . +COPY docker/common/install_pytorch.sh .Then adjust calls in devel:
-RUN bash ./install.sh --base +RUN bash /opt/tensorrt_llm/install/install.sh --base # ...apply similarly for subsequent RUN lines...
56-64: Ensure cleanup matches new install path.If adopting absolute paths, remove from the fixed dir.
-RUN rm install.sh \ - install_base.sh \ - install_cmake.sh \ - install_ccache.sh \ - install_cuda_toolkit.sh \ - install_tensorrt.sh \ - install_polygraphy.sh \ - install_mpi4py.sh \ - install_pytorch.sh +RUN rm -f /opt/tensorrt_llm/install/install.sh \ + /opt/tensorrt_llm/install/install_base.sh \ + /opt/tensorrt_llm/install/install_cmake.sh \ + /opt/tensorrt_llm/install/install_ccache.sh \ + /opt/tensorrt_llm/install/install_cuda_toolkit.sh \ + /opt/tensorrt_llm/install/install_tensorrt.sh \ + /opt/tensorrt_llm/install/install_polygraphy.sh \ + /opt/tensorrt_llm/install/install_mpi4py.sh \ + /opt/tensorrt_llm/install/install_pytorch.sh
93-99: Remove duplicate COPY of benchmarks.The directory is copied twice back-to-back.
-COPY benchmarks benchmarks COPY cpp cpp -COPY benchmarks benchmarks +COPY benchmarks benchmarks
39-39: Optional: keep PYTORCH_CUDA_ALLOC_CONF solely in Dockerfile, not inside install.sh.It’s already exported here; let install.sh respect env instead of re-exporting. No code change required if you accept earlier env-preservation fix.
91-99: Stage alias via ARG can be brittle on older Docker; consider direct alias.If older builders are used, “FROM ${DEVEL_IMAGE} AS wheel” may not resolve the prior stage alias. Using “FROM devel AS wheel” is safer.
-FROM ${DEVEL_IMAGE} AS wheel +FROM devel AS wheel
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
.gitignore(1 hunks)docker/Dockerfile.multi(2 hunks)docker/common/install.sh(1 hunks)enroot/Makefile(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
enroot/Makefiledocker/common/install.shdocker/Dockerfile.multi
🧠 Learnings (1)
📚 Learning: 2025-08-20T15:04:42.885Z
Learnt from: dbari
PR: NVIDIA/TensorRT-LLM#7095
File: docker/Dockerfile.multi:168-168
Timestamp: 2025-08-20T15:04:42.885Z
Learning: In docker/Dockerfile.multi, wildcard COPY for benchmarks (${CPP_BUILD_DIR}/benchmarks/*Benchmark) is intentionally used instead of directory copy because the benchmarks directory contains various other build artifacts during C++ builds, and only specific benchmark executables should be copied to the final image.
Applied to files:
docker/Dockerfile.multi
🪛 checkmake (0.2.2)
enroot/Makefile
[warning] 1-1: Missing required phony target "all"
(minphony)
[warning] 1-1: Missing required phony target "clean"
(minphony)
[warning] 1-1: Missing required phony target "test"
(minphony)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Shobhit Verma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
docker/common/install.sh (5)
95-95: Quote SCRIPT_DIR calls to avoid word-splitting/globbing.Safer even if paths contain spaces; aligns with earlier guidance.
-GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_base.sh $PYTHON_VERSION +GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION" -GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_cmake.sh +GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cmake.sh" -GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_ccache.sh +GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_ccache.sh" -GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_cuda_toolkit.sh +GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh" -bash $SCRIPT_DIR/install_tensorrt.sh \ +bash "$SCRIPT_DIR/install_tensorrt.sh" \ -bash $SCRIPT_DIR/install_polygraphy.sh +bash "$SCRIPT_DIR/install_polygraphy.sh" -GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_mpi4py.sh +GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh" -bash $SCRIPT_DIR/install_pytorch.sh $TORCH_INSTALL_TYPE +bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"Also applies to: 100-100, 105-105, 110-110, 115-115, 125-125, 130-130, 135-135
138-143: Make OpenCV swap idempotent and Python-version agnostic.Covers opencv-python/headless variants and removes cv2 from actual site-packages.
- pip3 uninstall -y opencv && \ - rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \ - pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir + pip3 uninstall -y opencv opencv-python opencv-python-headless || true + python3 - <<'PY' +import site, shutil, glob, pathlib +paths = list(filter(None, site.getsitepackages() + [site.getusersitepackages()])) +for p in paths: + for d in glob.glob(str(pathlib.Path(p)/"cv2*")): + shutil.rmtree(d, ignore_errors=True) +PY + pip3 install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}" \ + --force-reinstall --no-deps --no-cache-dir
83-85: Don’t clobber environment; use sane defaults.Overwriting GITHUB_MIRROR and TORCH_INSTALL_TYPE can break Docker/CI overrides; also avoid forcing PYTORCH_CUDA_ALLOC_CONF if already set.
-GITHUB_MIRROR="" -TORCH_INSTALL_TYPE="skip" -export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" +GITHUB_MIRROR="${GITHUB_MIRROR:-}" +# Prefer wheels by default; keep build-from-source opt-in. +TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}" +# Respect existing setting if provided. +export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"
81-86: Fail fast: add required-var validation helper.Prevents opaque failures when TRT_* and CUDA* versions are missing.
done + +# Require named environment variables to be non-empty +require_vars() { + local missing=() + for v in "$@"; do + if [ -z "${!v:-}" ]; then missing+=("$v"); fi + done + if [ "${#missing[@]}" -ne 0 ]; then + echo "Missing required variables: ${missing[*]}" >&2 + exit 2 + fi +}
113-121: TensorRT: validate inputs and pass through mirror.Ensures deterministic installs and mirror usage parity with other steps.
if [ $tensorrt -eq 1 ]; then echo "Installing TensorRT..." - bash $SCRIPT_DIR/install_tensorrt.sh \ + require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_tensorrt.sh" \ --TRT_VER=${TRT_VER} \ --CUDA_VER=${CUDA_VER} \ --CUDNN_VER=${CUDNN_VER} \ --NCCL_VER=${NCCL_VER} \ --CUBLAS_VER=${CUBLAS_VER} fi
🧹 Nitpick comments (3)
docker/common/install.sh (3)
92-96: Allow overriding Python version; quote arguments.Makes base install adaptable to future NGC tags.
- PYTHON_VERSION="3.12.3" + PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}" echo "Using Python version: $PYTHON_VERSION" - - GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_base.sh $PYTHON_VERSION + GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"
21-81: Add --help and guard against no-selection.Improves UX when invoked without flags.
while [[ $# -gt 0 ]]; do case $1 in + -h|--help) + cat <<'USAGE' +Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt] + [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf] [--all] +Env: GITHUB_MIRROR, TORCH_INSTALL_TYPE, PYTHON_VERSION, OPENCV_PKG, OPENCV_VERSION, PROTOBUF_SPEC +USAGE + exit 0 + ;; --base) @@ esac done + +# If nothing selected, print help. +if [ $base -eq 0 ] && [ $cmake -eq 0 ] && [ $ccache -eq 0 ] && \ + [ $cuda_toolkit -eq 0 ] && [ $tensorrt -eq 0 ] && [ $polygraphy -eq 0 ] && \ + [ $mpi4py -eq 0 ] && [ $pytorch -eq 0 ] && [ $opencv -eq 0 ] && [ $protobuf -eq 0 ]; then + echo "No components selected. Use --help for options." >&2 + exit 2 +fi
145-148: Parameterize protobuf spec.Allows pinning via env when needed for ABI compatibility.
- pip3 install --upgrade --no-cache-dir \ - "protobuf>=4.25.8" + pip3 install --upgrade --no-cache-dir \ + "${PROTOBUF_SPEC:-protobuf>=4.25.8}"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docker/common/install.sh(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
docker/common/install.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docker/common/install.sh (1)
2-7: Nice: strict mode, ERR trap, and robust SCRIPT_DIR.Good foundations for reliability and script-relative execution.
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
docker/common/install.sh (3)
138-143: Make OpenCV replacement idempotent and Python-agnostic; prefer python -m pip.Handles both opencv-python and -headless, cleans cv2 from the active site-packages, and allows version pin via env.
-if [ $opencv -eq 1 ]; then - echo "Installing OpenCV..." - pip3 uninstall -y opencv && \ - rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \ - pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir -fi +if [ $opencv -eq 1 ]; then + echo "Installing OpenCV..." + python3 -m pip uninstall -y opencv opencv-python opencv-python-headless || true + python3 - <<'PY' +import site, shutil, glob, pathlib +paths = [p for p in (site.getsitepackages() + [site.getusersitepackages()]) if p] +for p in paths: + for d in glob.glob(str(pathlib.Path(p)/"cv2*")): + shutil.rmtree(d, ignore_errors=True) +PY + python3 -m pip install ${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}} \ + --force-reinstall --no-deps --no-cache-dir +fi
63-75: Bug: --all won’t install PyTorch because TORCH_INSTALL_TYPE=skip.With --all, pytorch=1 but install_pytorch.sh gets "skip", resulting in no install. Default TORCH_INSTALL_TYPE to a sensible value and don’t clobber incoming env.
-GITHUB_MIRROR="" -TORCH_INSTALL_TYPE="skip" -export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" +GITHUB_MIRROR="${GITHUB_MIRROR:-}" +# Prefer prebuilt wheels by default; caller can override to "source". +TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}" +# Respect existing setting if provided. +export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}" @@ - bash $SCRIPT_DIR/install_pytorch.sh $TORCH_INSTALL_TYPE + bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"Also applies to: 83-86, 133-136
21-21: Fail fast: validate required vars for TensorRT.Prevent opaque failures by checking inputs before invoking install_tensorrt.sh.
@@ -while [[ $# -gt 0 ]]; do +require_vars() { + local missing=() + for v in "$@"; do + if [ -z "${!v:-}" ]; then missing+=("$v"); fi + done + if [ "${#missing[@]}" -ne 0 ]; then + echo "Missing required variables: ${missing[*]}" >&2 + exit 2 + fi +} @@ if [ $tensorrt -eq 1 ]; then echo "Installing TensorRT..." + require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER GITHUB_MIRROR="$GITHUB_MIRROR" bash "$SCRIPT_DIR/install_tensorrt.sh" \Also applies to: 113-121
🧹 Nitpick comments (5)
docker/common/install.sh (3)
1-8: Good: strict mode and robust SCRIPT_DIR.set -Eeuo pipefail, trap, and SCRIPT_DIR resolution look solid. Minor: quote "$SCRIPT_DIR/..." in all calls below to avoid word-splitting.
21-81: Add --help usage and a friendly error.Provide a usage string and handle -h|--help; improves UX and makes unknown options actionable.
@@ -while [[ $# -gt 0 ]]; do +usage() { + cat <<'USAGE' +Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt] + [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf] [--all] +Env: + GITHUB_MIRROR Optional Git mirror base URL + TORCH_INSTALL_TYPE wheels|source (default: wheels) + PYTHON_VERSION e.g. 3.12.3 (default: 3.12.3) +USAGE +} +while [[ $# -gt 0 ]]; do case $1 in + -h|--help) + usage; exit 0 + ;; @@ - *) - echo "Unknown option: $1" - exit 1 + *) + echo "Unknown option: $1" >&2 + usage; exit 1 ;;
145-148: Nit: prefer python -m pip for consistency.- pip3 install --upgrade --no-cache-dir \ + python3 -m pip install --upgrade --no-cache-dir \ "protobuf>=4.25.8"enroot/Makefile (2)
4-4: Make SOURCE_DIR resolution portable (macOS/dev nodes).readlink -f is not universally available. Use realpath if present, else Python fallback.
-SOURCE_DIR ?= $(shell readlink -f ..) +SOURCE_DIR ?= $(shell (command -v realpath >/dev/null && realpath -m ..) || python3 -c 'import os,sys;print(os.path.realpath(".."))')
6-8: Satisfy checkmake: add standard phony targets.Define all, clean, test, and mark PHONY to appease linters and improve DX.
ifndef MAKEFILE_PYXIS_INCLUDED MAKEFILE_PYXIS_INCLUDED := 1 +.PHONY: all build_sqsh clean test +all: build_sqsh +clean: + @rm -f -- "$(SQSH_PATH)" +test: + @echo "No tests; run 'make build_sqsh' to build the SQSH image." @@ endifAlso applies to: 10-10, 25-25
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
docker/common/install.sh(1 hunks)enroot/Makefile(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
docker/common/install.shenroot/Makefile
🪛 checkmake (0.2.2)
enroot/Makefile
[warning] 1-1: Missing required phony target "all"
(minphony)
[warning] 1-1: Missing required phony target "clean"
(minphony)
[warning] 1-1: Missing required phony target "test"
(minphony)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
enroot/Makefile (1)
1-2: OK: derive base image/tag from Dockerfile.Anchored grep keeps this reasonably stable. Keep in mind changes to ARG names will break this.
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
enroot/Makefile (1)
16-26: Quote paths/tags and forward extra env needed by install.sh.- srun \ - --container-image $(BASE_IMAGE):$(BASE_TAG) \ - --container-save $(SQSH_PATH) \ - --container-mounts $(SOURCE_DIR):/code/tensorrt_llm --container-workdir /code/tensorrt_llm/docker/common \ + srun \ + --container-image "$(BASE_IMAGE):$(BASE_TAG)" \ + --container-save "$(SQSH_PATH)" \ + --container-mounts "$(SOURCE_DIR):/code/tensorrt_llm" --container-workdir "/code/tensorrt_llm/docker/common" \ --container-mount-home --container-remap-root \ --container-env TRT_VER \ --container-env CUDA_VER \ --container-env CUDNN_VER \ --container-env NCCL_VER \ --container-env CUBLAS_VER \ + --container-env GITHUB_MIRROR \ + --container-env TORCH_INSTALL_TYPE \ ./install.sh --all
🧹 Nitpick comments (6)
docs/source/installation/build-from-source-linux.md (3)
90-96: Tighten wording and proper nouns (“Docker”), keep it concise.-If you wish to use enroot instead of docker, then you can build a sqsh file that has the identical environment as the development image `tensorrt_llm/devel:latest` as follows. +If you wish to use enroot instead of Docker, you can build an SQSH image with an identical environment to the development image `tensorrt_llm/devel:latest`:
97-105: Clarify the default image path explicitly (relative to repo root) to avoid ambiguity.- # Using default sqsh filename (enroot/tensorrt_llm.devel.sqsh) + # Using default SQSH filename (relative to the repo root): enroot/tensorrt_llm.devel.sqsh
106-115: Show both “custom” and “default” SQSH_PATH exports so users can copy/paste.- export SQSH_PATH=/path/to/dev_trtllm_image.sqsh + # If you used a custom path in Step 2: + export SQSH_PATH=/path/to/dev_trtllm_image.sqsh + # Or, if you used the default path: + # export SQSH_PATH=enroot/tensorrt_llm.devel.sqshenroot/Makefile (3)
1-6: Add PHONY targets (and optional clean) to appease checkmake and UX.+ .PHONY: build_sqsh run_sqsh clean + +clean: + @rm -f "$(SQSH_PATH)"
11-15: Minor: standardize naming in logs (“TensorRT-LLM”).- @echo "Building trtllm sqsh image." + @echo "Building TensorRT-LLM SQSH image."
1-2: Validate BASE_IMAGE/TAG discovery and fail fast if unset.BASE_IMAGE ?= $(shell grep '^ARG BASE_IMAGE=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="') BASE_TAG ?= $(shell grep '^ARG BASE_TAG=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="') +ifeq ($(strip $(BASE_IMAGE)),) + $(error BASE_IMAGE not found in ../docker/Dockerfile.multi; set BASE_IMAGE explicitly) +endif +ifeq ($(strip $(BASE_TAG)),) + $(error BASE_TAG not found in ../docker/Dockerfile.multi; set BASE_TAG explicitly) +endif
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
docker/common/install.sh(1 hunks)docs/source/installation/build-from-source-linux.md(1 hunks)enroot/Makefile(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docker/common/install.sh
🧰 Additional context used
📓 Path-based instructions (1)
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
docs/source/installation/build-from-source-linux.mdenroot/Makefile
🪛 checkmake (0.2.2)
enroot/Makefile
[warning] 1-1: Missing required phony target "all"
(minphony)
[warning] 1-1: Missing required phony target "clean"
(minphony)
[warning] 1-1: Missing required phony target "test"
(minphony)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Shobhit Verma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
docker/Dockerfile.multi (1)
24-33: Propagate GITHUB_MIRROR into devel stage (and avoid clobbering in install.sh).devel’s RUNs don’t see GITHUB_MIRROR; declare and export it so orchestrator and sub-scripts can use it. Also ensure install.sh doesn’t reset it.
FROM base AS devel ARG TRT_VER ARG CUDA_VER ARG CUDNN_VER ARG NCCL_VER ARG CUBLAS_VER +ARG GITHUB_MIRROR="" +ENV GITHUB_MIRROR=${GITHUB_MIRROR} ENV PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999" COPY docker/common/install.sh install.sh
🧹 Nitpick comments (5)
docker/Dockerfile.multi (5)
35-45: Make install invocations fail fast.Add strict bash flags so partial failures don’t silently continue.
-RUN bash ./install.sh --base +RUN bash -euo pipefail ./install.sh --base @@ -RUN bash ./install.sh --cmake +RUN bash -euo pipefail ./install.sh --cmake @@ -RUN bash ./install.sh --ccache +RUN bash -euo pipefail ./install.sh --ccache @@ -RUN bash ./install.sh --cuda_toolkit +RUN bash -euo pipefail ./install.sh --cuda_toolkit
47-52: Scope env to single RUN instead of exporting.Keeps the build env cleaner and avoids accidental reuse by later layers.
-RUN export TRT_VER=${TRT_VER} \ - && export CUDA_VER=${CUDA_VER} \ - && export CUDNN_VER=${CUDNN_VER} \ - && export NCCL_VER=${NCCL_VER} \ - && export CUBLAS_VER=${CUBLAS_VER} \ - && bash ./install.sh --tensorrt +RUN TRT_VER="${TRT_VER}" \ + CUDA_VER="${CUDA_VER}" \ + CUDNN_VER="${CUDNN_VER}" \ + NCCL_VER="${NCCL_VER}" \ + CUBLAS_VER="${CUBLAS_VER}" \ + bash -euo pipefail ./install.sh --tensorrt
55-65: Confirm orchestrator implements --opencv/--protobuf or copy the helpers.If install.sh delegates to per-component scripts, these two are not copied and will fail. Either keep it all inside install.sh or copy/remove them consistently.
Option A (copy helpers):
COPY docker/common/install_pytorch.sh install_pytorch.sh -RUN bash ./install.sh --pytorch +RUN bash -euo pipefail ./install.sh --pytorch + +COPY docker/common/install_opencv.sh install_opencv.sh +COPY docker/common/install_protobuf.sh install_protobuf.sh -RUN bash ./install.sh --opencv -RUN bash ./install.sh --protobuf +RUN bash -euo pipefail ./install.sh --opencv +RUN bash -euo pipefail ./install.sh --protobufAnd update cleanup (see next comment) to remove them.
66-74: Harmonize cleanup with what was copied.If you adopt Option A above, include opencv/protobuf scripts in the cleanup; also use -f for idempotency.
-RUN rm install.sh \ - install_base.sh \ - install_cmake.sh \ - install_ccache.sh \ - install_cuda_toolkit.sh \ - install_tensorrt.sh \ - install_polygraphy.sh \ - install_mpi4py.sh \ - install_pytorch.sh +RUN rm -f install.sh \ + install_base.sh \ + install_cmake.sh \ + install_ccache.sh \ + install_cuda_toolkit.sh \ + install_tensorrt.sh \ + install_polygraphy.sh \ + install_mpi4py.sh \ + install_pytorch.sh \ + install_opencv.sh \ + install_protobuf.sh
103-106: Remove duplicate COPY of benchmarks.There are two identical COPY benchmarks directives; keep one to reduce context size and layers.
COPY benchmarks benchmarks COPY cpp cpp -COPY benchmarks benchmarks COPY scripts scripts
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docker/Dockerfile.multi(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Filenames compiled into a target must be case-insensitively unique
Files:
docker/Dockerfile.multi
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docker/Dockerfile.multi (2)
80-85: Guard COPYs from TRITON image (paths can drift).Some TRITON images don’t ship /opt/tritonserver/caches or may restructure dirs. Please confirm these exist for TRITON_BASE_TAG=25.06-py3 or add a minimal preflight check in the stage before COPY (or drop non-essential dirs).
118-119: Nice use of BuildKit caches for pip/ccache.This should materially improve rebuild times for wheel builds.
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
|
/bot run |
|
PR_Github #18214 [ run ] triggered by Bot |
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
|
PR_Github #18214 [ run ] completed with state |
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
|
/bot run |
|
PR_Github #18330 [ run ] triggered by Bot |
|
PR_Github #18330 [ run ] completed with state |
|
/bot run --stage-list "Build-Docker-Images" |
|
PR_Github #18342 [ run ] triggered by Bot |
|
PR_Github #18342 [ run ] completed with state |
Signed-off-by: Shobhit Verma <[email protected]> Signed-off-by: v-shobhit <[email protected]> Co-authored-by: Zhihan Jiang <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]> Signed-off-by: v-shobhit <[email protected]> Co-authored-by: Zhihan Jiang <[email protected]>
[feat] Use a shell context to install dependancies
Description
TRTLLM uses a docker build flow to pull a base PyTorch image and install dependancies. With scale out systems and HPC clusters choosing pyxis/enroot over docker daemons, we need a shell context to install dependancies to be able to call this within an
srunjob step and createenrootimages.Current:
The current flow is as follows.
COPY docker/common/install_dep.sh RUN install_dep.sh && rm install_dep.shdockerd://tensorrt_llm/devel:latestPyxis
Currently, to use enroot, we use a docker daemon to get the built
tensorrt_llm/devel:latestand then import this. There is still a dependance on docker daemon.In order to solve this and eliminate dependance on docker, we first write a master shell script that call the
install_dep.shscript to install dependancies. User can now call this directly:This script is re-used in Dockerfile.multi to avoid duplication of dependancy installation logic.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Refactor
Documentation
Chores