Skip to content

Conversation

@v-shobhit
Copy link
Contributor

@v-shobhit v-shobhit commented Aug 29, 2025

[feat] Use a shell context to install dependancies

Description

TRTLLM uses a docker build flow to pull a base PyTorch image and install dependancies. With scale out systems and HPC clusters choosing pyxis/enroot over docker daemons, we need a shell context to install dependancies to be able to call this within an srun job step and create enroot images.

Current:

The current flow is as follows.

  • Pull a base docker image
  • Run a docker build command with the form of:
COPY docker/common/install_dep.sh
RUN install_dep.sh && rm install_dep.sh
  • A base image is built of the form dockerd://tensorrt_llm/devel:latest

Pyxis

Currently, to use enroot, we use a docker daemon to get the built tensorrt_llm/devel:latest and then import this. There is still a dependance on docker daemon.

In order to solve this and eliminate dependance on docker, we first write a master shell script that call the install_dep.sh script to install dependancies. User can now call this directly:

make -C enroot build_sqsh SQSH_PATH=... BASE_IMAGE=... BASE_TAG=...

This script is re-used in Dockerfile.multi to avoid duplication of dependancy installation logic.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Enroot-based development workflow with make targets to build/run SQSH images.
    • Triton-enabled multi-stage build flow for development, build, and release artifacts.
    • Build-time options to customize CUDA, TensorRT and related component versions.
    • PYTORCH_CUDA_ALLOC_CONF tuning exposed for GPU memory behavior.
  • Refactor

    • Consolidated image setup into a single install-orchestrator driven flow.
  • Documentation

    • Added Enroot workflow to Linux build-from-source guide.
  • Chores

    • Ignore generated Enroot SQSH image files.

@v-shobhit v-shobhit requested review from a team as code owners August 29, 2025 22:09
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 29, 2025

📝 Walkthrough

Walkthrough

Replaces per-component Docker install steps with a centralized docker/common/install.sh orchestrator, adds Triton multi-stage build stages (triton, tritondevel, tritonbuild, tritonrelease), introduces an Enroot SQSH workflow and Makefile, documents the Enroot option, and adds .gitignore entry for the SQSH image.

Changes

Cohort / File(s) Summary of Changes
Docker multi-stage refactor & Triton integration
docker/Dockerfile.multi
Replaces individual install script invocations with docker/common/install.sh; adds ARGs (TRT_VER, CUDA_VER, CUDNN_VER, NCCL_VER, CUBLAS_VER), PYTORCH_CUDA_ALLOC_CONF, COPY docker/common/install.sh install.sh; introduces stages triton, tritondevel, tritonbuild, tritonrelease; integrates Triton artifacts into build/release graph.
Installer orchestrator
docker/common/install.sh
Adds new flag-driven orchestrator supporting --base, --cmake, --ccache, --cuda_toolkit, --tensorrt, --polygraphy, --mpi4py, --pytorch, --opencv, --protobuf, --all; resolves script dir, enables strict error handling, and dispatches to existing per-component install scripts with env args.
Enroot workflow / Makefile
enroot/Makefile
Adds include guard, new variables (e.g., BASE_IMAGE/BASE_TAG/SQSH_PATH/PYTHON_VERSION/TORCH_INSTALL_TYPE and TRT/CUDA-related versions), and targets build_sqsh and run_sqsh; build_sqsh uses srun to build an Enroot SQSH and runs install.sh --all; run_sqsh launches saved SQSH with forwarded env.
Documentation
docs/source/installation/build-from-source-linux.md
Adds an Enroot-based development workflow as an alternative to Docker: steps to allocate nodes, build the SQSH image, and run an Enroot sandbox mirroring the Docker dev flow.
Git ignore
.gitignore
Adds comment and ignore pattern enroot/tensorrt_llm.devel.sqsh.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev as Developer
  participant Docker as Docker build
  participant Devel as devel stage
  participant Install as install.sh
  participant Sub as per-component scripts

  Dev->>Docker: docker build (multi-stage)
  Docker->>Devel: build devel stage
  Devel->>Install: bash install.sh --base --cmake ... --protobuf
  Install->>Sub: invoke install_base.sh / install_cmake.sh / ...
  Sub-->>Install: completed
  Install-->>Devel: setup complete

  rect rgba(220,235,255,0.35)
    note over Docker,Devel: Triton integration (new stages)
    participant Triton as triton
    participant TritonDevel as tritondevel
    Docker->>Triton: FROM ${TRITON_IMAGE}:${TRITON_BASE_TAG}
    Triton-->>TritonDevel: copy UCX/etcd/etc.
    TritonDevel->>Install: run install_triton.sh / orchestrated steps
  end
Loading
sequenceDiagram
  autonumber
  actor User
  participant SL as srun (build_sqsh)
  participant Container as BaseImage
  participant Install as docker/common/install.sh
  participant SQ as tensorrt_llm.devel.sqsh
  participant Run as srun (run_sqsh)

  User->>SL: make -C enroot build_sqsh
  SL->>Container: start $(BASE_IMAGE):$(BASE_TAG)
  Container->>Install: ./install.sh --all (env: TRT_VER, CUDA_VER, ...)
  Install-->>Container: components installed
  Container->>SQ: save SQSH

  User->>Run: make -C enroot run_sqsh
  Run->>SQ: launch enroot container
  SQ-->>User: interactive session (RUN_CMD)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • MartinMarciniszyn
  • chzblych
  • kaiyux
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@v-shobhit v-shobhit marked this pull request as draft August 29, 2025 22:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (10)
docker/common/install.sh (6)

16-21: Gate noisy version echos behind selection.

Printing empty TRT/CUDA/etc. values on every run is noisy. Only echo them when --tensorrt is chosen.

-echo "TRT_VER: $TRT_VER"
-echo "CUDA_VER: $CUDA_VER"
-echo "CUDNN_VER: $CUDNN_VER"
-echo "NCCL_VER: $NCCL_VER"
-echo "CUBLAS_VER: $CUBLAS_VER"
+# Shown later only if --tensorrt is selected

36-96: Add --help and usage; fail fast on no flags.

Small UX improvement for shell usage on HPC jobs.

+usage() {
+  cat <<'EOF'
+Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt]
+                  [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf] [--all]
+Env:
+  GITHUB_MIRROR           Propagated to installers (optional)
+  TORCH_INSTALL_TYPE      e.g., wheels|source|skip (default: wheels)
+EOF
+}
+
 while [[ $# -gt 0 ]]; do
     case $1 in
+        -h|--help)
+            usage; exit 0 ;;
         --base)
             base=1
             shift 1
             ;;
@@
         --all)
             base=1
             cmake=1
             ccache=1
             cuda_toolkit=1
             tensorrt=1
             polygraphy=1
             mpi4py=1
             pytorch=1
             opencv=1
             protobuf=1
             shift 1
             ;;
         *)
             echo "Unknown option: $1"
-            exit 1
+            usage; exit 1
             ;;
     esac
 done
+
+# Optional: require at least one selection
+if (( base + cmake + ccache + cuda_toolkit + tensorrt + polygraphy + mpi4py + pytorch + opencv + protobuf == 0 )); then
+  echo "No components selected." >&2
+  usage; exit 1
+fi

119-123: Same here: use SCRIPT_DIR.

-GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh
+GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh"

124-132: Add env echo after validation and use SCRIPT_DIR.

-    echo "Installing TensorRT..."
+    echo "Installing TensorRT..."
+    echo "TRT_VER=${TRT_VER} CUDA_VER=${CUDA_VER} CUDNN_VER=${CUDNN_VER} NCCL_VER=${NCCL_VER} CUBLAS_VER=${CUBLAS_VER}"
-    bash ./install_tensorrt.sh \
+    bash "$SCRIPT_DIR/install_tensorrt.sh" \

134-147: Mirror pass-through for all that support it; ensure SCRIPT_DIR and TORCH_INSTALL_TYPE usage.

-    bash ./install_polygraphy.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_polygraphy.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh"
@@
-    bash ./install_pytorch.sh $TORCH_INSTALL_TYPE
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"

156-159: Consider constraints and reproducibility for protobuf.

Unpinned lower bound can drift. If you have constraints.txt, route through it; otherwise pin via env for reproducibility.

-    pip3 install --upgrade --no-cache-dir \
-    "protobuf>=4.25.8"
+    pip3 install --upgrade --no-cache-dir ${PROTOBUF_PKG:-protobuf}${PROTOBUF_VERSION:+==${PROTOBUF_VERSION}} \
+        ${PIP_CONSTRAINT:+-c ${PIP_CONSTRAINT}}
docker/Dockerfile.multi (4)

42-42: Avoid duplicate PYTORCH_CUDA_ALLOC_CONF definitions.

You set it here and inside install.sh; keep one. I’d prefer it in Dockerfile (visibility at build/run).

-ENV PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
+# Already exported by install.sh or keep here and remove from install.sh for clarity.

47-52: Export version envs in a single RUN or use ENV for layer caching clarity.

Current pattern works; optionally set them as ENV then invoke --tensorrt in a separate RUN to improve cache hits across rebuilds with stable versions.


59-67: Combine cleanup with last install step to reduce layers; consider wildcard.

Minor image-size optimization.

-RUN rm install.sh \
-    install_base.sh \
-    install_cmake.sh \
-    install_ccache.sh \
-    install_cuda_toolkit.sh \
-    install_tensorrt.sh \
-    install_polygraphy.sh \
-    install_mpi4py.sh \
-    install_pytorch.sh
+RUN bash ./install.sh --protobuf && \
+    rm install.sh install_*.sh

44-57: Option: use a single orchestrator call for build determinism.

RUN bash ./install.sh --all simplifies the layer graph; but you lose per-step caching. Keep as-is if caching is desired.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 43cb50f and 553ec7b.

📒 Files selected for processing (2)
  • docker/Dockerfile.multi (2 hunks)
  • docker/common/install.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T15:04:42.885Z
Learnt from: dbari
PR: NVIDIA/TensorRT-LLM#7095
File: docker/Dockerfile.multi:168-168
Timestamp: 2025-08-20T15:04:42.885Z
Learning: In docker/Dockerfile.multi, wildcard COPY for benchmarks (${CPP_BUILD_DIR}/benchmarks/*Benchmark) is intentionally used instead of directory copy because the benchmarks directory contains various other build artifacts during C++ builds, and only specific benchmark executables should be copied to the final image.

Applied to files:

  • docker/Dockerfile.multi
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docker/common/install.sh (1)

102-118: Call sub-scripts via SCRIPT_DIR and propagate mirror consistently.

Use the resolved path everywhere; mirror propagation is already consistent here.

docker/Dockerfile.multi (1)

9-17: Copying orchestrator and installers looks good.

Paths align with RUN usage; no exec-bit needed since invoked via bash.

@v-shobhit v-shobhit changed the title Use a shell context to install dependancies [None][feat] Use a shell context to install dependancies Aug 29, 2025
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 29, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
docker/common/install.sh (3)

138-143: Make OpenCV replacement idempotent and Python-version agnostic; prefer python -m pip.

Covers both opencv-python(-headless), removes cv2 from actual site-packages, supports pin via env.

-if [ $opencv -eq 1 ]; then
+if [ $opencv -eq 1 ]; then
     echo "Installing OpenCV..."
-    pip3 uninstall -y opencv && \
-    rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \
-    pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
+    python3 -m pip uninstall -y opencv opencv-python opencv-python-headless || true
+    python3 - <<'PY'
+import site, shutil, glob, pathlib
+paths = [p for p in (site.getsitepackages()+[site.getusersitepackages()]) if p]
+for p in paths:
+    for d in glob.glob(str(pathlib.Path(p)/"cv2*")):
+        shutil.rmtree(d, ignore_errors=True)
+PY
+    python3 -m pip install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}"
+        --force-reinstall --no-deps --no-cache-dir
 fi

1-1: Enable strict mode and resolve script-relative paths; call sub-scripts via absolute paths.

Prevents masked failures and CWD-dependent breakage (common with enroot/pyxis). Replace all ./install_*.sh invocations to use the script’s directory.

-#!/bin/bash
+#!/bin/bash
+set -Eeuo pipefail
+shopt -s nullglob
+trap 'echo "[install.sh] Error on line $LINENO" >&2' ERR
+
+# Resolve script directory for robust relative pathing
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"

@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cmake.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cmake.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_ccache.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_ccache.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh"
@@
-    bash ./install_tensorrt.sh \
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_tensorrt.sh" \
         --TRT_VER=${TRT_VER} \
         --CUDA_VER=${CUDA_VER} \
         --CUDNN_VER=${CUDNN_VER} \
         --NCCL_VER=${NCCL_VER} \
         --CUBLAS_VER=${CUBLAS_VER}
@@
-    bash ./install_polygraphy.sh
+    bash "$SCRIPT_DIR/install_polygraphy.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh"
@@
-    bash ./install_pytorch.sh $TORCH_INSTALL_TYPE
+    bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"

Also applies to: 95-96, 100-101, 105-106, 110-111, 115-121, 125-126, 130-131, 135-136


83-86: Don’t clobber environment-provided defaults; choose sane fallbacks.

Respect upstream-provided GITHUB_MIRROR and TORCH_INSTALL_TYPE; keep PYTORCH_CUDA_ALLOC_CONF configurable.

-GITHUB_MIRROR=""
-TORCH_INSTALL_TYPE="skip"
-export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
+GITHUB_MIRROR="${GITHUB_MIRROR:-}"
+# Prefer prebuilt wheels unless caller overrides
+TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}"
+export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"
🧹 Nitpick comments (4)
docker/common/install.sh (4)

21-81: Add --help and a usage banner for discoverability.

Improves UX and aids CI logs.

+usage() {
+  cat <<'USAGE'
+Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt]
+                  [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf]
+                  [--all] [-h|--help]
+Environment:
+  GITHUB_MIRROR, TORCH_INSTALL_TYPE, PYTHON_VERSION, OPENCV_PKG, OPENCV_VERSION, PROTOBUF_VERSION
+USAGE
+}
@@
     case $1 in
+        -h|--help)
+            usage
+            exit 0
+            ;;
         --base)

92-96: Parameterize Python version via env; quote args.

Allows callers to override without editing the script.

-    PYTHON_VERSION="3.12.3"
+    PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}"
     echo "Using Python version: $PYTHON_VERSION"
-
-    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"

145-148: Allow protobuf version override and use python -m pip.

Keeps flexibility for compatibility bumps.

-if [ $protobuf -eq 1 ]; then
-    pip3 install --upgrade --no-cache-dir \
-    "protobuf>=4.25.8"
-fi
+if [ $protobuf -eq 1 ]; then
+    python3 -m pip install --upgrade --no-cache-dir \
+        "${PROTOBUF_SPEC:-protobuf>=4.25.8}"
+fi

3-7: Minor: gate version echos behind a verbose flag.

Reduces noise in CI logs; enable via VERBOSE=1.

-echo "TRT_VER: $TRT_VER"
-echo "CUDA_VER: $CUDA_VER"
-echo "CUDNN_VER: $CUDNN_VER"
-echo "NCCL_VER: $NCCL_VER"
-echo "CUBLAS_VER: $CUBLAS_VER"
+[ "${VERBOSE:-0}" -eq 1 ] && {
+  echo "TRT_VER: ${TRT_VER:-}"
+  echo "CUDA_VER: ${CUDA_VER:-}"
+  echo "CUDNN_VER: ${CUDNN_VER:-}"
+  echo "NCCL_VER: ${NCCL_VER:-}"
+  echo "CUBLAS_VER: ${CUBLAS_VER:-}"
+}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 553ec7b and 284b965.

📒 Files selected for processing (2)
  • docker/Dockerfile.multi (2 hunks)
  • docker/common/install.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker/Dockerfile.multi
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docker/common/install.sh (1)

133-136: Confirm interface for install_pytorch.sh and default.

Current default TORCH_INSTALL_TYPE=“wheels” assumed; verify the sub-script expects a single positional arg; update to a flag if needed.

Would you like me to scan Dockerfile.multi and install_pytorch.sh to align the invocation?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
docker/common/install.sh (5)

133-138: Make OpenCV replacement idempotent and Python-version agnostic.

Current sequence exits on missing package and hardcodes dist-packages path.

-    pip3 uninstall -y opencv && \
-    rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \
-    pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
+    pip3 uninstall -y opencv opencv-python opencv-python-headless || true
+    python3 - <<'PY'
+import site, shutil, glob, pathlib
+paths = [p for p in (site.getsitepackages() + [site.getusersitepackages()]) if p]
+for p in paths:
+    for d in glob.glob(str(pathlib.Path(p)/"cv2*")):
+        shutil.rmtree(d, ignore_errors=True)
+PY
+    pip3 install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}" \
+        --force-reinstall --no-deps --no-cache-dir

1-3: Add trap, nullglob, and resolve SCRIPT_DIR for robust execution across CWDs.

Prevents masked failures and fixes relative-path breakage when invoked from outside docker/common.

 #!/bin/bash
 set -Eeuo pipefail
+
+shopt -s nullglob
+trap 'echo "[install.sh] Error on line $LINENO" >&2' ERR
+
+# Resolve script directory for robust relative pathing
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"

78-81: Don’t clobber env; provide sane defaults instead.

Respect caller-supplied GITHUB_MIRROR/TORCH_INSTALL_TYPE and keep CUDA alloc conf overrideable.

-GITHUB_MIRROR=""
-TORCH_INSTALL_TYPE="skip"
-export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
+GITHUB_MIRROR="${GITHUB_MIRROR:-}"
+# Default to prebuilt wheels unless explicitly overridden
+TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}"
+# Keep overrideable for standalone usage
+export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"

93-131: Call sub-scripts via SCRIPT_DIR and quote all arguments.

Fixes failures when CWD != script dir (e.g., enroot/srun usage) and avoids word-splitting.

-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cmake.sh
+    GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_cmake.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_ccache.sh
+    GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_ccache.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh
+    GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_cuda_toolkit.sh"
@@
-    bash ./install_tensorrt.sh \
+    GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_tensorrt.sh" \
         --TRT_VER=${TRT_VER} \
         --CUDA_VER=${CUDA_VER} \
         --CUDNN_VER=${CUDNN_VER} \
         --NCCL_VER=${NCCL_VER} \
         --CUBLAS_VER=${CUBLAS_VER}
@@
-    bash ./install_polygraphy.sh
+    bash "${SCRIPT_DIR}/install_polygraphy.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh
+    GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_mpi4py.sh"
@@
-    bash ./install_pytorch.sh $TORCH_INSTALL_TYPE
+    bash "${SCRIPT_DIR}/install_pytorch.sh" "${TORCH_INSTALL_TYPE}"

76-81: Validate required vars for TensorRT; fail fast with a clear error.

Prevents opaque failures when TRT/CUDA/CUDNN/NCCL/CUBLAS versions aren’t set.

 done
 
+# Require named environment variables to be non-empty
+require_vars() {
+    local missing=()
+    for v in "$@"; do
+        if [ -z "${!v:-}" ]; then missing+=("$v"); fi
+    done
+    if [ "${#missing[@]}" -ne 0 ]; then
+        echo "Missing required variables: ${missing[*]}" >&2
+        exit 2
+    fi
+}

And invoke before calling install_tensorrt.sh:

 if [ $tensorrt -eq 1 ]; then
     echo "Installing TensorRT..."
+    require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER
🧹 Nitpick comments (3)
docker/common/install.sh (3)

82-91: Make Python version configurable via env; quote args.

Improves portability and reproducibility.

-    PYTHON_VERSION="3.12.3"
-    echo "Using Python version: $PYTHON_VERSION"
+    PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}"
+    echo "Using Python version: ${PYTHON_VERSION}"
-
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION
+    GITHUB_MIRROR="${GITHUB_MIRROR}" bash "${SCRIPT_DIR}/install_base.sh" "${PYTHON_VERSION}"

140-143: Allow configurable protobuf version/package; keep minimum default.

Improves flexibility across environments requiring pinned protobuf.

-    pip3 install --upgrade --no-cache-dir \
-    "protobuf>=4.25.8"
+    pip3 install --upgrade --no-cache-dir \
+    "${PROTOBUF_PKG:-protobuf}${PROTOBUF_VERSION:->=4.25.8}"

16-76: Add --help usage for discoverability.

Small UX boost; harmless default behavior.

     case $1 in
+        -h|--help)
+            cat <<'USAGE'
+Usage: docker/common/install.sh [options]
+  --base --cmake --ccache --cuda_toolkit --tensorrt --polygraphy --mpi4py --pytorch --opencv --protobuf
+  --all            Install all components
+  -h, --help       Show this help
+Environment:
+  GITHUB_MIRROR, TORCH_INSTALL_TYPE (default: wheels), PYTHON_VERSION, OPENCV_PKG/OPENCV_VERSION,
+  PROTOBUF_PKG/PROTOBUF_VERSION, and version vars for --tensorrt: TRT_VER, CUDA_VER, CUDNN_VER, NCCL_VER, CUBLAS_VER
+USAGE
+            exit 0
+            ;;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 284b965 and 584524d.

📒 Files selected for processing (1)
  • docker/common/install.sh (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
@v-shobhit v-shobhit force-pushed the dev/shobhitv/shell-build-ctx branch from 584524d to 1a2b1bc Compare September 2, 2025 17:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
docker/common/install.sh (4)

133-138: Make OpenCV replacement idempotent and Python-version agnostic.

Handle both opencv-python(-headless), tolerate absence, and remove cv2 via site-packages discovery.

 if [ $opencv -eq 1 ]; then
     echo "Installing OpenCV..."
-    pip3 uninstall -y opencv && \
-    rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \
-    pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
+    pip3 uninstall -y opencv opencv-python opencv-python-headless || true
+    python3 - <<'PY'
+import site, shutil, glob, pathlib
+paths = [p for p in (site.getsitepackages()+[site.getusersitepackages()]) if p]
+for p in paths:
+    for d in glob.glob(str(pathlib.Path(p)/"cv2*")):
+        shutil.rmtree(d, ignore_errors=True)
+PY
+    pip3 install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}" \
+        --force-reinstall --no-deps --no-cache-dir
 fi

1-1: Enable script-relative execution of sub-scripts.

Calls like “bash ./install_*.sh” break when CWD ≠ script dir (e.g., Enroot). Resolve SCRIPT_DIR and call via absolute paths.

 #!/bin/bash
 set -Eeuo pipefail
+shopt -s nullglob
+
+# Resolve script directory
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"

Replace all sub-script invocations:

-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cmake.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cmake.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_ccache.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_ccache.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_cuda_toolkit.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh"
@@
-    bash ./install_tensorrt.sh \
+    bash "$SCRIPT_DIR/install_tensorrt.sh" \
@@
-    bash ./install_polygraphy.sh
+    bash "$SCRIPT_DIR/install_polygraphy.sh"
@@
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_mpi4py.sh
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh"
@@
-    bash ./install_pytorch.sh $TORCH_INSTALL_TYPE
+    bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"

78-81: Preserve env-provided mirrors and set a sensible default for TORCH_INSTALL_TYPE.

Don’t clobber GITHUB_MIRROR; default TORCH_INSTALL_TYPE to “wheels” so --pytorch actually installs.

-GITHUB_MIRROR=""
-TORCH_INSTALL_TYPE="skip"
-export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
+GITHUB_MIRROR="${GITHUB_MIRROR:-}"
+TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}"
+export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"

108-116: Fail fast if required TRT env versions are missing.

Avoid opaque errors in install_tensorrt.sh by validating version variables first.

+require_vars() {
+  local missing=()
+  for v in "$@"; do [ -z "${!v:-}" ] && missing+=("$v"); done
+  if [ "${#missing[@]}" -ne 0 ]; then
+    echo "Missing required variables: ${missing[*]}" >&2
+    exit 2
+  fi
+}
 if [ $tensorrt -eq 1 ]; then
     echo "Installing TensorRT..."
+    require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER
     bash "$SCRIPT_DIR/install_tensorrt.sh" \
         --TRT_VER=${TRT_VER} \
         --CUDA_VER=${CUDA_VER} \
         --CUDNN_VER=${CUDNN_VER} \
         --NCCL_VER=${NCCL_VER} \
         --CUBLAS_VER=${CUBLAS_VER}
 fi
docker/Dockerfile.multi (1)

31-55: Propagate GITHUB_MIRROR and TORCH_INSTALL_TYPE into the devel stage.

ARGs don’t cross stages; without re-declaring, install.sh sees empty values and may skip PyTorch.

 FROM base AS devel
 
 ARG TRT_VER
 ARG CUDA_VER
 ARG CUDNN_VER
 ARG NCCL_VER
 ARG CUBLAS_VER
+ARG GITHUB_MIRROR=""
+ARG TORCH_INSTALL_TYPE="wheels"
+ENV GITHUB_MIRROR=${GITHUB_MIRROR}
+ENV TORCH_INSTALL_TYPE=${TORCH_INSTALL_TYPE}
 
 ENV PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
 RUN bash ./install.sh --base
 RUN bash ./install.sh --cmake
 RUN bash ./install.sh --ccache
 RUN bash ./install.sh --cuda_toolkit
@@
 RUN bash ./install.sh --mpi4py
-RUN bash ./install.sh --pytorch
+RUN bash ./install.sh --pytorch
 RUN bash ./install.sh --opencv
 RUN bash ./install.sh --protobuf
🧹 Nitpick comments (12)
enroot/Makefile (5)

10-23: Quote mount paths and mark target as phony.

Avoid failures when SOURCE_DIR has spaces and prevent make from treating build_sqsh as a file target.

+.PHONY: build_sqsh
 build_sqsh:
 	@echo "Building trtllm sqsh image. Base image: $(BASE_IMAGE):$(BASE_TAG). Location: $(SQSH_PATH)"
 
 	srun \
 		--container-image $(BASE_IMAGE):$(BASE_TAG) \
 		--container-save $(SQSH_PATH) \
-		--container-mounts $(SOURCE_DIR):/code/tensorrt_llm --container-workdir /code/tensorrt_llm/docker/common \
+		--container-mounts "$(SOURCE_DIR)":/code/tensorrt_llm --container-workdir /code/tensorrt_llm/docker/common \
 		--container-mount-home --container-remap-root \
 		--container-env TRT_VER \
 		--container-env CUDA_VER \
 		--container-env CUDNN_VER \
 		--container-env NCCL_VER \
 		--container-env CUBLAS_VER \
 		./install.sh --all

1-2: Harden base image/tag derivation and set sane fallbacks.

If grep fails, BASE_IMAGE/TAG become empty. Provide defaults to keep make usable.

-BASE_IMAGE         ?= $(shell grep '^ARG BASE_IMAGE=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="')
-BASE_TAG           ?= $(shell grep '^ARG BASE_TAG=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="')
+BASE_IMAGE         ?= $(shell grep -m1 '^ARG BASE_IMAGE=' ../docker/Dockerfile.multi | sed -E 's/.*= *"?([^" ]+).*/\1/')
+BASE_TAG           ?= $(shell grep -m1 '^ARG BASE_TAG='   ../docker/Dockerfile.multi | sed -E 's/.*= *"?([^" ]+).*/\1/')
+BASE_IMAGE         := $(if $(BASE_IMAGE),$(BASE_IMAGE),nvcr.io/nvidia/pytorch)
+BASE_TAG           := $(if $(BASE_TAG),$(BASE_TAG),25.06-py3)

18-23: Pass through optional env knobs used by install.sh.

Forward GITHUB_MIRROR and TORCH_INSTALL_TYPE so users can control mirrors and PyTorch flow.

 		--container-env NCCL_VER \
 		--container-env CUBLAS_VER \
+		--container-env GITHUB_MIRROR \
+		--container-env TORCH_INSTALL_TYPE \
 		./install.sh --all

11-11: Consistent naming in status message.

Use “TensorRT-LLM” (or “tensorrt_llm”) instead of “trtllm” for clarity.

-	@echo "Building trtllm sqsh image. Base image: $(BASE_IMAGE):$(BASE_TAG). Location: $(SQSH_PATH)"
+	@echo "Building TensorRT-LLM SQSH image. Base image: $(BASE_IMAGE):$(BASE_TAG). Location: $(SQSH_PATH)"

1-1: Optional: add conventional targets flagged by checkmake.

Minimal “all” and “clean” targets quell CI warnings without changing behavior.

+.PHONY: all clean test
+all: build_sqsh
+clean:
+	@rm -f -- $(SQSH_PATH)
+test:
+	@echo "No tests for enroot Makefile"
docker/common/install.sh (2)

87-91: Parameterize Python version for base install.

Allow override via env; print actual version used.

-    PYTHON_VERSION="3.12.3"
-    echo "Using Python version: $PYTHON_VERSION"
+    PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}"
+    echo "Using Python version: ${PYTHON_VERSION}"
 
-    GITHUB_MIRROR=$GITHUB_MIRROR bash ./install_base.sh $PYTHON_VERSION
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "${PYTHON_VERSION}"

71-76: Add a minimal usage/help message.

Improves UX for unknown options and CI logs.

-        *)
-            echo "Unknown option: $1"
-            exit 1
+        -h|--help)
+            echo "Usage: $0 [--base|--cmake|--ccache|--cuda_toolkit|--tensorrt|--polygraphy|--mpi4py|--pytorch|--opencv|--protobuf|--all]"
+            exit 0
+            ;;
+        *)
+            echo "Unknown option: $1" >&2
+            exit 1
docker/Dockerfile.multi (5)

8-17: Copy installer scripts to a fixed path and use absolute paths.

Relying on image WORKDIR is brittle across NGC tags. Copy to /opt/tensorrt_llm/install and invoke via absolute path.

-FROM ${BASE_IMAGE}:${BASE_TAG} AS base
-COPY docker/common/install.sh install.sh
-COPY docker/common/install_base.sh install_base.sh
-COPY docker/common/install_cmake.sh install_cmake.sh
-COPY docker/common/install_ccache.sh install_ccache.sh
-COPY docker/common/install_cuda_toolkit.sh install_cuda_toolkit.sh
-COPY docker/common/install_tensorrt.sh install_tensorrt.sh
-COPY docker/common/install_polygraphy.sh install_polygraphy.sh
-COPY docker/common/install_mpi4py.sh install_mpi4py.sh
-COPY docker/common/install_pytorch.sh install_pytorch.sh
+FROM ${BASE_IMAGE}:${BASE_TAG} AS base
+WORKDIR /opt/tensorrt_llm/install
+COPY docker/common/install.sh .
+COPY docker/common/install_base.sh .
+COPY docker/common/install_cmake.sh .
+COPY docker/common/install_ccache.sh .
+COPY docker/common/install_cuda_toolkit.sh .
+COPY docker/common/install_tensorrt.sh .
+COPY docker/common/install_polygraphy.sh .
+COPY docker/common/install_mpi4py.sh .
+COPY docker/common/install_pytorch.sh .

Then adjust calls in devel:

-RUN bash ./install.sh --base
+RUN bash /opt/tensorrt_llm/install/install.sh --base
# ...apply similarly for subsequent RUN lines...

56-64: Ensure cleanup matches new install path.

If adopting absolute paths, remove from the fixed dir.

-RUN rm install.sh \
-    install_base.sh \
-    install_cmake.sh \
-    install_ccache.sh \
-    install_cuda_toolkit.sh \
-    install_tensorrt.sh \
-    install_polygraphy.sh \
-    install_mpi4py.sh \
-    install_pytorch.sh
+RUN rm -f /opt/tensorrt_llm/install/install.sh \
+          /opt/tensorrt_llm/install/install_base.sh \
+          /opt/tensorrt_llm/install/install_cmake.sh \
+          /opt/tensorrt_llm/install/install_ccache.sh \
+          /opt/tensorrt_llm/install/install_cuda_toolkit.sh \
+          /opt/tensorrt_llm/install/install_tensorrt.sh \
+          /opt/tensorrt_llm/install/install_polygraphy.sh \
+          /opt/tensorrt_llm/install/install_mpi4py.sh \
+          /opt/tensorrt_llm/install/install_pytorch.sh

93-99: Remove duplicate COPY of benchmarks.

The directory is copied twice back-to-back.

-COPY benchmarks benchmarks
 COPY cpp cpp
-COPY benchmarks benchmarks
+COPY benchmarks benchmarks

39-39: Optional: keep PYTORCH_CUDA_ALLOC_CONF solely in Dockerfile, not inside install.sh.

It’s already exported here; let install.sh respect env instead of re-exporting. No code change required if you accept earlier env-preservation fix.


91-99: Stage alias via ARG can be brittle on older Docker; consider direct alias.

If older builders are used, “FROM ${DEVEL_IMAGE} AS wheel” may not resolve the prior stage alias. Using “FROM devel AS wheel” is safer.

-FROM ${DEVEL_IMAGE} AS wheel
+FROM devel AS wheel
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 584524d and 1a2b1bc.

📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • docker/Dockerfile.multi (2 hunks)
  • docker/common/install.sh (1 hunks)
  • enroot/Makefile (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • enroot/Makefile
  • docker/common/install.sh
  • docker/Dockerfile.multi
🧠 Learnings (1)
📚 Learning: 2025-08-20T15:04:42.885Z
Learnt from: dbari
PR: NVIDIA/TensorRT-LLM#7095
File: docker/Dockerfile.multi:168-168
Timestamp: 2025-08-20T15:04:42.885Z
Learning: In docker/Dockerfile.multi, wildcard COPY for benchmarks (${CPP_BUILD_DIR}/benchmarks/*Benchmark) is intentionally used instead of directory copy because the benchmarks directory contains various other build artifacts during C++ builds, and only specific benchmark executables should be copied to the final image.

Applied to files:

  • docker/Dockerfile.multi
🪛 checkmake (0.2.2)
enroot/Makefile

[warning] 1-1: Missing required phony target "all"

(minphony)


[warning] 1-1: Missing required phony target "clean"

(minphony)


[warning] 1-1: Missing required phony target "test"

(minphony)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Signed-off-by: Shobhit Verma <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
docker/common/install.sh (5)

95-95: Quote SCRIPT_DIR calls to avoid word-splitting/globbing.

Safer even if paths contain spaces; aligns with earlier guidance.

-GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_base.sh $PYTHON_VERSION
+GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"

-GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_cmake.sh
+GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cmake.sh"

-GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_ccache.sh
+GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_ccache.sh"

-GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_cuda_toolkit.sh
+GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_cuda_toolkit.sh"

-bash $SCRIPT_DIR/install_tensorrt.sh \
+bash "$SCRIPT_DIR/install_tensorrt.sh" \

-bash $SCRIPT_DIR/install_polygraphy.sh
+bash "$SCRIPT_DIR/install_polygraphy.sh"

-GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_mpi4py.sh
+GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_mpi4py.sh"

-bash $SCRIPT_DIR/install_pytorch.sh $TORCH_INSTALL_TYPE
+bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"

Also applies to: 100-100, 105-105, 110-110, 115-115, 125-125, 130-130, 135-135


138-143: Make OpenCV swap idempotent and Python-version agnostic.

Covers opencv-python/headless variants and removes cv2 from actual site-packages.

-    pip3 uninstall -y opencv && \
-    rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \
-    pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
+    pip3 uninstall -y opencv opencv-python opencv-python-headless || true
+    python3 - <<'PY'
+import site, shutil, glob, pathlib
+paths = list(filter(None, site.getsitepackages() + [site.getusersitepackages()]))
+for p in paths:
+    for d in glob.glob(str(pathlib.Path(p)/"cv2*")):
+        shutil.rmtree(d, ignore_errors=True)
+PY
+    pip3 install "${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}}" \
+        --force-reinstall --no-deps --no-cache-dir

83-85: Don’t clobber environment; use sane defaults.

Overwriting GITHUB_MIRROR and TORCH_INSTALL_TYPE can break Docker/CI overrides; also avoid forcing PYTORCH_CUDA_ALLOC_CONF if already set.

-GITHUB_MIRROR=""
-TORCH_INSTALL_TYPE="skip"
-export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
+GITHUB_MIRROR="${GITHUB_MIRROR:-}"
+# Prefer wheels by default; keep build-from-source opt-in.
+TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}"
+# Respect existing setting if provided.
+export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"

81-86: Fail fast: add required-var validation helper.

Prevents opaque failures when TRT_* and CUDA* versions are missing.

 done
+
+# Require named environment variables to be non-empty
+require_vars() {
+  local missing=()
+  for v in "$@"; do
+    if [ -z "${!v:-}" ]; then missing+=("$v"); fi
+  done
+  if [ "${#missing[@]}" -ne 0 ]; then
+    echo "Missing required variables: ${missing[*]}" >&2
+    exit 2
+  fi
+}

113-121: TensorRT: validate inputs and pass through mirror.

Ensures deterministic installs and mirror usage parity with other steps.

 if [ $tensorrt -eq 1 ]; then
     echo "Installing TensorRT..."
-    bash $SCRIPT_DIR/install_tensorrt.sh \
+    require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_tensorrt.sh" \
         --TRT_VER=${TRT_VER} \
         --CUDA_VER=${CUDA_VER} \
         --CUDNN_VER=${CUDNN_VER} \
         --NCCL_VER=${NCCL_VER} \
         --CUBLAS_VER=${CUBLAS_VER}
 fi
🧹 Nitpick comments (3)
docker/common/install.sh (3)

92-96: Allow overriding Python version; quote arguments.

Makes base install adaptable to future NGC tags.

-    PYTHON_VERSION="3.12.3"
+    PYTHON_VERSION="${PYTHON_VERSION:-3.12.3}"
     echo "Using Python version: $PYTHON_VERSION"
-
-    GITHUB_MIRROR=$GITHUB_MIRROR bash $SCRIPT_DIR/install_base.sh $PYTHON_VERSION
+    GITHUB_MIRROR=$GITHUB_MIRROR bash "$SCRIPT_DIR/install_base.sh" "$PYTHON_VERSION"

21-81: Add --help and guard against no-selection.

Improves UX when invoked without flags.

 while [[ $# -gt 0 ]]; do
     case $1 in
+        -h|--help)
+            cat <<'USAGE'
+Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt]
+                  [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf] [--all]
+Env: GITHUB_MIRROR, TORCH_INSTALL_TYPE, PYTHON_VERSION, OPENCV_PKG, OPENCV_VERSION, PROTOBUF_SPEC
+USAGE
+            exit 0
+            ;;
         --base)
@@
     esac
 done
+
+# If nothing selected, print help.
+if [ $base -eq 0 ] && [ $cmake -eq 0 ] && [ $ccache -eq 0 ] && \
+   [ $cuda_toolkit -eq 0 ] && [ $tensorrt -eq 0 ] && [ $polygraphy -eq 0 ] && \
+   [ $mpi4py -eq 0 ] && [ $pytorch -eq 0 ] && [ $opencv -eq 0 ] && [ $protobuf -eq 0 ]; then
+  echo "No components selected. Use --help for options." >&2
+  exit 2
+fi

145-148: Parameterize protobuf spec.

Allows pinning via env when needed for ABI compatibility.

-    pip3 install --upgrade --no-cache-dir \
-    "protobuf>=4.25.8"
+    pip3 install --upgrade --no-cache-dir \
+    "${PROTOBUF_SPEC:-protobuf>=4.25.8}"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2b1bc and be25635.

📒 Files selected for processing (1)
  • docker/common/install.sh (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • docker/common/install.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docker/common/install.sh (1)

2-7: Nice: strict mode, ERR trap, and robust SCRIPT_DIR.

Good foundations for reliability and script-relative execution.

Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
docker/common/install.sh (3)

138-143: Make OpenCV replacement idempotent and Python-agnostic; prefer python -m pip.

Handles both opencv-python and -headless, cleans cv2 from the active site-packages, and allows version pin via env.

-if [ $opencv -eq 1 ]; then
-    echo "Installing OpenCV..."
-    pip3 uninstall -y opencv && \
-    rm -rf /usr/local/lib/python3*/dist-packages/cv2/ && \
-    pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
-fi
+if [ $opencv -eq 1 ]; then
+    echo "Installing OpenCV..."
+    python3 -m pip uninstall -y opencv opencv-python opencv-python-headless || true
+    python3 - <<'PY'
+import site, shutil, glob, pathlib
+paths = [p for p in (site.getsitepackages() + [site.getusersitepackages()]) if p]
+for p in paths:
+    for d in glob.glob(str(pathlib.Path(p)/"cv2*")):
+        shutil.rmtree(d, ignore_errors=True)
+PY
+    python3 -m pip install ${OPENCV_PKG:-opencv-python-headless}${OPENCV_VERSION:+==${OPENCV_VERSION}} \
+        --force-reinstall --no-deps --no-cache-dir
+fi

63-75: Bug: --all won’t install PyTorch because TORCH_INSTALL_TYPE=skip.

With --all, pytorch=1 but install_pytorch.sh gets "skip", resulting in no install. Default TORCH_INSTALL_TYPE to a sensible value and don’t clobber incoming env.

-GITHUB_MIRROR=""
-TORCH_INSTALL_TYPE="skip"
-export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
+GITHUB_MIRROR="${GITHUB_MIRROR:-}"
+# Prefer prebuilt wheels by default; caller can override to "source".
+TORCH_INSTALL_TYPE="${TORCH_INSTALL_TYPE:-wheels}"
+# Respect existing setting if provided.
+export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-garbage_collection_threshold:0.99999}"
@@
-    bash $SCRIPT_DIR/install_pytorch.sh $TORCH_INSTALL_TYPE
+    bash "$SCRIPT_DIR/install_pytorch.sh" "$TORCH_INSTALL_TYPE"

Also applies to: 83-86, 133-136


21-21: Fail fast: validate required vars for TensorRT.

Prevent opaque failures by checking inputs before invoking install_tensorrt.sh.

@@
-while [[ $# -gt 0 ]]; do
+require_vars() {
+  local missing=()
+  for v in "$@"; do
+    if [ -z "${!v:-}" ]; then missing+=("$v"); fi
+  done
+  if [ "${#missing[@]}" -ne 0 ]; then
+    echo "Missing required variables: ${missing[*]}" >&2
+    exit 2
+  fi
+}
@@
 if [ $tensorrt -eq 1 ]; then
     echo "Installing TensorRT..."
+    require_vars TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER
     GITHUB_MIRROR="$GITHUB_MIRROR" bash "$SCRIPT_DIR/install_tensorrt.sh" \

Also applies to: 113-121

🧹 Nitpick comments (5)
docker/common/install.sh (3)

1-8: Good: strict mode and robust SCRIPT_DIR.

set -Eeuo pipefail, trap, and SCRIPT_DIR resolution look solid. Minor: quote "$SCRIPT_DIR/..." in all calls below to avoid word-splitting.


21-81: Add --help usage and a friendly error.

Provide a usage string and handle -h|--help; improves UX and makes unknown options actionable.

@@
-while [[ $# -gt 0 ]]; do
+usage() {
+  cat <<'USAGE'
+Usage: install.sh [--base] [--cmake] [--ccache] [--cuda_toolkit] [--tensorrt]
+                  [--polygraphy] [--mpi4py] [--pytorch] [--opencv] [--protobuf] [--all]
+Env:
+  GITHUB_MIRROR           Optional Git mirror base URL
+  TORCH_INSTALL_TYPE      wheels|source (default: wheels)
+  PYTHON_VERSION          e.g. 3.12.3 (default: 3.12.3)
+USAGE
+}
+while [[ $# -gt 0 ]]; do
     case $1 in
+        -h|--help)
+            usage; exit 0
+            ;;
@@
-        *)
-            echo "Unknown option: $1"
-            exit 1
+        *)
+            echo "Unknown option: $1" >&2
+            usage; exit 1
             ;;

145-148: Nit: prefer python -m pip for consistency.

-    pip3 install --upgrade --no-cache-dir \
+    python3 -m pip install --upgrade --no-cache-dir \
     "protobuf>=4.25.8"
enroot/Makefile (2)

4-4: Make SOURCE_DIR resolution portable (macOS/dev nodes).

readlink -f is not universally available. Use realpath if present, else Python fallback.

-SOURCE_DIR        ?= $(shell readlink -f ..)
+SOURCE_DIR        ?= $(shell (command -v realpath >/dev/null && realpath -m ..) || python3 -c 'import os,sys;print(os.path.realpath(".."))')

6-8: Satisfy checkmake: add standard phony targets.

Define all, clean, test, and mark PHONY to appease linters and improve DX.

 ifndef MAKEFILE_PYXIS_INCLUDED
 MAKEFILE_PYXIS_INCLUDED := 1
 
+.PHONY: all build_sqsh clean test
+all: build_sqsh
+clean:
+	@rm -f -- "$(SQSH_PATH)"
+test:
+	@echo "No tests; run 'make build_sqsh' to build the SQSH image."
@@
 endif

Also applies to: 10-10, 25-25

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between be25635 and 719d46a.

📒 Files selected for processing (2)
  • docker/common/install.sh (1 hunks)
  • enroot/Makefile (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • docker/common/install.sh
  • enroot/Makefile
🪛 checkmake (0.2.2)
enroot/Makefile

[warning] 1-1: Missing required phony target "all"

(minphony)


[warning] 1-1: Missing required phony target "clean"

(minphony)


[warning] 1-1: Missing required phony target "test"

(minphony)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
enroot/Makefile (1)

1-2: OK: derive base image/tag from Dockerfile.

Anchored grep keeps this reasonably stable. Keep in mind changes to ARG names will break this.

Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: Shobhit Verma <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
enroot/Makefile (1)

16-26: Quote paths/tags and forward extra env needed by install.sh.

-	srun \
-		--container-image $(BASE_IMAGE):$(BASE_TAG) \
-		--container-save $(SQSH_PATH) \
-		--container-mounts $(SOURCE_DIR):/code/tensorrt_llm --container-workdir /code/tensorrt_llm/docker/common \
+	srun \
+		--container-image "$(BASE_IMAGE):$(BASE_TAG)" \
+		--container-save "$(SQSH_PATH)" \
+		--container-mounts "$(SOURCE_DIR):/code/tensorrt_llm" --container-workdir "/code/tensorrt_llm/docker/common" \
 		--container-mount-home --container-remap-root \
 		--container-env TRT_VER \
 		--container-env CUDA_VER \
 		--container-env CUDNN_VER \
 		--container-env NCCL_VER \
 		--container-env CUBLAS_VER \
+		--container-env GITHUB_MIRROR \
+		--container-env TORCH_INSTALL_TYPE \
 		./install.sh --all
🧹 Nitpick comments (6)
docs/source/installation/build-from-source-linux.md (3)

90-96: Tighten wording and proper nouns (“Docker”), keep it concise.

-If you wish to use enroot instead of docker, then you can build a sqsh file that has the identical environment as the development image `tensorrt_llm/devel:latest` as follows.
+If you wish to use enroot instead of Docker, you can build an SQSH image with an identical environment to the development image `tensorrt_llm/devel:latest`:

97-105: Clarify the default image path explicitly (relative to repo root) to avoid ambiguity.

-    # Using default sqsh filename (enroot/tensorrt_llm.devel.sqsh)
+    # Using default SQSH filename (relative to the repo root): enroot/tensorrt_llm.devel.sqsh

106-115: Show both “custom” and “default” SQSH_PATH exports so users can copy/paste.

-    export SQSH_PATH=/path/to/dev_trtllm_image.sqsh
+    # If you used a custom path in Step 2:
+    export SQSH_PATH=/path/to/dev_trtllm_image.sqsh
+    # Or, if you used the default path:
+    # export SQSH_PATH=enroot/tensorrt_llm.devel.sqsh
enroot/Makefile (3)

1-6: Add PHONY targets (and optional clean) to appease checkmake and UX.

+ .PHONY: build_sqsh run_sqsh clean
+
+clean:
+	@rm -f "$(SQSH_PATH)"

11-15: Minor: standardize naming in logs (“TensorRT-LLM”).

-	@echo "Building trtllm sqsh image."
+	@echo "Building TensorRT-LLM SQSH image."

1-2: Validate BASE_IMAGE/TAG discovery and fail fast if unset.

 BASE_IMAGE         ?= $(shell grep '^ARG BASE_IMAGE=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="')
 BASE_TAG           ?= $(shell grep '^ARG BASE_TAG=' ../docker/Dockerfile.multi | grep -o '=.*' | tr -d '="')
+ifeq ($(strip $(BASE_IMAGE)),)
+  $(error BASE_IMAGE not found in ../docker/Dockerfile.multi; set BASE_IMAGE explicitly)
+endif
+ifeq ($(strip $(BASE_TAG)),)
+  $(error BASE_TAG not found in ../docker/Dockerfile.multi; set BASE_TAG explicitly)
+endif
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 719d46a and 38409bd.

📒 Files selected for processing (3)
  • docker/common/install.sh (1 hunks)
  • docs/source/installation/build-from-source-linux.md (1 hunks)
  • enroot/Makefile (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker/common/install.sh
🧰 Additional context used
📓 Path-based instructions (1)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • docs/source/installation/build-from-source-linux.md
  • enroot/Makefile
🪛 checkmake (0.2.2)
enroot/Makefile

[warning] 1-1: Missing required phony target "all"

(minphony)


[warning] 1-1: Missing required phony target "clean"

(minphony)


[warning] 1-1: Missing required phony target "test"

(minphony)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Signed-off-by: Shobhit Verma <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
docker/Dockerfile.multi (1)

24-33: Propagate GITHUB_MIRROR into devel stage (and avoid clobbering in install.sh).

devel’s RUNs don’t see GITHUB_MIRROR; declare and export it so orchestrator and sub-scripts can use it. Also ensure install.sh doesn’t reset it.

 FROM base AS devel
 
 ARG TRT_VER
 ARG CUDA_VER
 ARG CUDNN_VER
 ARG NCCL_VER
 ARG CUBLAS_VER
+ARG GITHUB_MIRROR=""
+ENV GITHUB_MIRROR=${GITHUB_MIRROR}
 
 ENV PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.99999"
 
 COPY docker/common/install.sh install.sh
🧹 Nitpick comments (5)
docker/Dockerfile.multi (5)

35-45: Make install invocations fail fast.

Add strict bash flags so partial failures don’t silently continue.

-RUN bash ./install.sh --base
+RUN bash -euo pipefail ./install.sh --base
@@
-RUN bash ./install.sh --cmake
+RUN bash -euo pipefail ./install.sh --cmake
@@
-RUN bash ./install.sh --ccache
+RUN bash -euo pipefail ./install.sh --ccache
@@
-RUN bash ./install.sh --cuda_toolkit
+RUN bash -euo pipefail ./install.sh --cuda_toolkit

47-52: Scope env to single RUN instead of exporting.

Keeps the build env cleaner and avoids accidental reuse by later layers.

-RUN export TRT_VER=${TRT_VER} \
-    && export CUDA_VER=${CUDA_VER} \
-    && export CUDNN_VER=${CUDNN_VER} \
-    && export NCCL_VER=${NCCL_VER} \
-    && export CUBLAS_VER=${CUBLAS_VER} \
-    && bash ./install.sh --tensorrt
+RUN TRT_VER="${TRT_VER}" \
+    CUDA_VER="${CUDA_VER}" \
+    CUDNN_VER="${CUDNN_VER}" \
+    NCCL_VER="${NCCL_VER}" \
+    CUBLAS_VER="${CUBLAS_VER}" \
+    bash -euo pipefail ./install.sh --tensorrt

55-65: Confirm orchestrator implements --opencv/--protobuf or copy the helpers.

If install.sh delegates to per-component scripts, these two are not copied and will fail. Either keep it all inside install.sh or copy/remove them consistently.

Option A (copy helpers):

 COPY docker/common/install_pytorch.sh install_pytorch.sh
-RUN bash ./install.sh --pytorch
+RUN bash -euo pipefail ./install.sh --pytorch
+
+COPY docker/common/install_opencv.sh install_opencv.sh
+COPY docker/common/install_protobuf.sh install_protobuf.sh
-RUN bash ./install.sh --opencv
-RUN bash ./install.sh --protobuf
+RUN bash -euo pipefail ./install.sh --opencv
+RUN bash -euo pipefail ./install.sh --protobuf

And update cleanup (see next comment) to remove them.


66-74: Harmonize cleanup with what was copied.

If you adopt Option A above, include opencv/protobuf scripts in the cleanup; also use -f for idempotency.

-RUN rm install.sh \
-    install_base.sh \
-    install_cmake.sh \
-    install_ccache.sh \
-    install_cuda_toolkit.sh \
-    install_tensorrt.sh \
-    install_polygraphy.sh \
-    install_mpi4py.sh \
-    install_pytorch.sh
+RUN rm -f install.sh \
+       install_base.sh \
+       install_cmake.sh \
+       install_ccache.sh \
+       install_cuda_toolkit.sh \
+       install_tensorrt.sh \
+       install_polygraphy.sh \
+       install_mpi4py.sh \
+       install_pytorch.sh \
+       install_opencv.sh \
+       install_protobuf.sh

103-106: Remove duplicate COPY of benchmarks.

There are two identical COPY benchmarks directives; keep one to reduce context size and layers.

 COPY benchmarks benchmarks
 COPY cpp cpp
-COPY benchmarks benchmarks
 COPY scripts scripts
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 38409bd and 4f4bb4d.

📒 Files selected for processing (1)
  • docker/Dockerfile.multi (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Filenames compiled into a target must be case-insensitively unique

Files:

  • docker/Dockerfile.multi
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
docker/Dockerfile.multi (2)

80-85: Guard COPYs from TRITON image (paths can drift).

Some TRITON images don’t ship /opt/tritonserver/caches or may restructure dirs. Please confirm these exist for TRITON_BASE_TAG=25.06-py3 or add a minimal preflight check in the stage before COPY (or drop non-essential dirs).


118-119: Nice use of BuildKit caches for pip/ccache.

This should materially improve rebuild times for wheel builds.

Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
@v-shobhit
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18214 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18214 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13650 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
@v-shobhit
Copy link
Contributor Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18330 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18330 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13747 completed with status: 'FAILURE'

@v-shobhit
Copy link
Contributor Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18342 [ run ] triggered by Bot

@MartinMarciniszyn MartinMarciniszyn merged commit 0652514 into NVIDIA:main Sep 10, 2025
4 of 5 checks passed
@tensorrt-cicd
Copy link
Collaborator

PR_Github #18342 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13758 (Partly Tested) completed with status: 'FAILURE'

Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Co-authored-by: Zhihan Jiang <[email protected]>
MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025
Signed-off-by: Shobhit Verma <[email protected]>
Signed-off-by: v-shobhit <[email protected]>
Co-authored-by: Zhihan Jiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants