-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][fix] Fix build of tritonbuild/tritonrelease image #7003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][fix] Fix build of tritonbuild/tritonrelease image #7003
Conversation
📝 WalkthroughWalkthroughDocker build and local build scripts were updated: Dockerfile.multi now forwards a derived Triton short tag (-s) to the inflight_batcher_llm build script; the build script gained option parsing, absolute path handling, TRITON_SHORT_TAG resolution, cmake/make install steps; Makefile canonicalizes USER_CACHE_DIR; .dockerignore ignores examples/**/*.safetensors. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant DockerBuild as Docker Build
participant Dockerfile as Dockerfile (tritonbuild stage)
participant BuildScript as inflight_batcher_llm build.sh
participant TagScript as get_triton_tag.sh
participant CMake as cmake & make
participant FS as Filesystem (/opt backends)
DockerBuild->>Dockerfile: run tritonbuild stage
Dockerfile->>BuildScript: RUN build.sh -s "r${TRITON_BASE_TAG%-py3}" ...
BuildScript->>BuildScript: resolve DIRNAME, BUILD_DIR (realpath)
alt -s provided
BuildScript->>BuildScript: set TRITON_SHORT_TAG from -s
else -s missing
BuildScript->>TagScript: invoke get_triton_tag.sh
TagScript-->>BuildScript: return TRITON_SHORT_TAG
end
BuildScript->>CMake: cmake ... -DTRITON_*_REPO_TAG="${TRITON_SHORT_TAG}"
CMake->>CMake: make && make install
CMake->>FS: ensure /opt/tritonserver/backends/tensorrtllm exists
Note over BuildScript,CMake: exports/LD_LIBRARY_PATH and conditional BUILD_TESTS handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
docker/Makefile (1)
140-140: Make USER_CACHE_DIR resolution portable and resilient
To ensure compatibility across Linux, macOS/BSD, and cases where the directory doesn’t yet exist, detect which tool is available and fall back to the raw path:• File:
docker/Makefile(around line 140)
• Replace the existing line with the following:-USER_CACHE_DIR ?= $(shell readlink -f "${HOME_DIR}/.cache") +USER_CACHE_DIR ?= $(shell \ + command -v realpath >/dev/null 2>&1 && realpath "${HOME_DIR}/.cache" 2>/dev/null || \ + command -v readlink >/dev/null 2>&1 && readlink -f "${HOME_DIR}/.cache" 2>/dev/null || \ + echo "${HOME_DIR}/.cache" \ +)• This will:
- Use
realpathif installed (supports GNU & BSD variants).- Otherwise try
readlink -fon Linux.- Otherwise fall back to the unevaluated path.
Optional: Remove or update the Linux-only test script in the comment, or add a note to verify behavior on macOS and Linux hosts.
docker/Dockerfile.multi (1)
178-179: Copy only what build.sh actually needs to reduce build-context size and cache invalidations.The build script needs
jenkins/scripts/get_triton_tag.shanddocker/Dockerfile.multi. Copying the entirejenkinsanddockertrees may bloat the layer and invalidate cache more often.Apply this narrower copy:
-COPY ./jenkins/ ./jenkins/ -COPY ./docker/ ./docker/ +COPY ./jenkins/scripts ./jenkins/scripts +COPY ./docker/Dockerfile.multi ./docker/Dockerfile.multiIf other files are required later, extend incrementally. Also ensure
.dockerignoredoesn’t accidentally exclude these paths.triton_backend/inflight_batcher_llm/scripts/build.sh (4)
38-41: Fail fast and keep the image layer small (combine apt ops + cleanup).
- Add
set -euo pipefailto stop on the first error.- Chain
apt-get update && apt-get installand clean/var/lib/apt/liststo avoid leaving cache in the image layer.-set -x -apt-get update -apt-get install -y --no-install-recommends rapidjson-dev +set -euo pipefail +set -x +export DEBIAN_FRONTEND=noninteractive +apt-get update && \ + apt-get install -y --no-install-recommends rapidjson-dev && \ + rm -rf /var/lib/apt/lists/*
42-48: Use mkdir -p and quote paths to avoid failures on reruns and spaces.
mkdir $BUILD_DIRfails if the dir already exists. Quoting prevents issues with spaces.-DIRNAME="$(dirname "$(realpath $0)")" - -BUILD_DIR="$DIRNAME/../build" -mkdir $BUILD_DIR -BUILD_DIR=$(cd -- "$BUILD_DIR" && pwd) -cd $BUILD_DIR +DIRNAME="$(dirname "$(realpath "$0")")" +BUILD_DIR="$DIRNAME/../build" +mkdir -p "$BUILD_DIR" +BUILD_DIR=$(cd -- "$BUILD_DIR" && pwd) +cd "$BUILD_DIR"
49-54: Minor: avoid redundant CMake args duplication for tests flag.
-DUSE_CXX11_ABI=ONis included in both branches. Keep it once to reduce drift.-BUILD_TESTS_ARG="-DUSE_CXX11_ABI=ON" -if [[ "$BUILD_UNIT_TESTS" == "true" ]]; then - BUILD_TESTS_ARG="-DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON" -fi +BUILD_TESTS_ARG="-DUSE_CXX11_ABI=ON" +if [[ "$BUILD_UNIT_TESTS" == "true" ]]; then + BUILD_TESTS_ARG="${BUILD_TESTS_ARG} -DBUILD_TESTS=ON" +fi
56-59: Guard TRITON tag resolution with a clear error to aid debugging.If
get_triton_tag.shisn’t present or fails, the build proceeds and errors later in less-obvious ways. Add a small check.-LLM_ROOT="${DIRNAME}/../../.." -TRITON_SHORT_TAG=$("$LLM_ROOT/jenkins/scripts/get_triton_tag.sh" "$LLM_ROOT") +LLM_ROOT="${DIRNAME}/../../.." +GET_TAG_SCRIPT="$LLM_ROOT/jenkins/scripts/get_triton_tag.sh" +if [[ ! -x "$GET_TAG_SCRIPT" ]]; then + echo "Error: missing $GET_TAG_SCRIPT in image. Ensure Dockerfile.multi copies jenkins/scripts." >&2 + exit 1 +fi +TRITON_SHORT_TAG=$("$GET_TAG_SCRIPT" "$LLM_ROOT")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
docker/Dockerfile.multi(1 hunks)docker/Makefile(1 hunks)triton_backend/inflight_batcher_llm/scripts/build.sh(2 hunks)
|
/bot run |
|
PR_Github #15791 [ run ] triggered by Bot |
tburt-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have expertise here. I'll remove my review in favor of Martin's review, since his review has already been requested.
|
PR_Github #15791 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
triton_backend/inflight_batcher_llm/scripts/build.sh (1)
59-70: CMake flags are merged into one argument when -u is set.Quoting "${BUILD_TESTS_ARG}" turns “-DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON” into a single argv, which CMake won’t parse into two -D options. Use an argv array and avoid duplicated -DUSE_CXX11_ABI.
-BUILD_TESTS_ARG="-DUSE_CXX11_ABI=ON" -if [[ "$BUILD_UNIT_TESTS" == "true" ]]; then - BUILD_TESTS_ARG="-DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON" -fi +BUILD_TESTS_ARGS=(-DUSE_CXX11_ABI=ON) +if [[ "$BUILD_UNIT_TESTS" == "true" ]]; then + BUILD_TESTS_ARGS=(-DBUILD_TESTS=ON -DUSE_CXX11_ABI=ON) +fi @@ -cmake -DCMAKE_INSTALL_PREFIX:PATH="$(pwd)/install" \ - "${BUILD_TESTS_ARG}" \ +cmake -DCMAKE_INSTALL_PREFIX:PATH="$(pwd)/install" \ + "${BUILD_TESTS_ARGS[@]}" \ -DTRITON_COMMON_REPO_TAG="${TRITON_SHORT_TAG}" \ -DTRITON_CORE_REPO_TAG="${TRITON_SHORT_TAG}" \ -DTRITON_THIRD_PARTY_REPO_TAG="${TRITON_SHORT_TAG}" \ -DTRITON_BACKEND_REPO_TAG="${TRITON_SHORT_TAG}" \ ..
🧹 Nitpick comments (5)
.dockerignore (1)
12-12: Consider ignoring other heavyweight artifact formats in examples.Optionally add other common weight/checkpoint formats to prevent accidental bloat:
- examples/**/*.{pt,pth,ckpt,gguf,tar.gz,zip}
Only if these exist in your workflows.
docker/Dockerfile.multi (1)
179-179: Pass a more robust Triton short tag for variant TRITON_BASE_TAGs.The current expansion strips only a trailing “-py3”. For tags like 25.06-py3-something, this yields r25.06-py3-something, which likely won’t match Triton repo tags. Prefer taking the prefix before the first dash.
Apply:
-RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh -s "r${TRITON_BASE_TAG%-py3}" +RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh -s "r${TRITON_BASE_TAG%%-*}"This addresses the prior feedback to pass the tag directly and makes it resilient across tag variants.
triton_backend/inflight_batcher_llm/scripts/build.sh (3)
18-29: Update help text to reflect the new -s option.The parser now supports -s but Help() doesn’t mention it. Add -s <triton_short_tag> to avoid confusion.
Suggested Help snippet (outside changed lines):
echo "Syntax: build.sh [-h] [-t <trt_root>] [-u] [-s <triton_short_tag>]" echo "options:" echo "h Print this Help." echo "t Location of TensorRT library" echo "u Build unit tests" echo "s Triton short tag (e.g., r25.06). If omitted, derived from repo."
40-47: Guard fallback tag resolution and fix misleading comment.If -s isn’t provided, you call jenkins/scripts/get_triton_tag.sh. Add a sanity check and clarify the comment to avoid implying the value comes from Dockerfile.
-DIRNAME="$(dirname "$(realpath "$0")")" -if [ -z "$TRITON_SHORT_TAG" ]; then - # Get TRITON_SHORT_TAG from docker/Dockerfile.multi - LLM_ROOT="${DIRNAME}/../../.." - TRITON_SHORT_TAG=$("$LLM_ROOT/jenkins/scripts/get_triton_tag.sh" "$LLM_ROOT") -fi +DIRNAME="$(dirname "$(realpath "$0")")" +if [ -z "$TRITON_SHORT_TAG" ]; then + # Derive TRITON_SHORT_TAG from repository if not provided via -s + LLM_ROOT="${DIRNAME}/../../.." + GET_TAG="$LLM_ROOT/jenkins/scripts/get_triton_tag.sh" + if [[ ! -x "$GET_TAG" ]]; then + echo "Error: $GET_TAG not found or not executable; provide -s <triton_short_tag>." >&2 + exit 1 + fi + TRITON_SHORT_TAG="$("$GET_TAG" "$LLM_ROOT")" +fi echo "Using TRITON_SHORT_TAG=${TRITON_SHORT_TAG}"
53-55: Use mkdir -p to allow idempotent rebuilds.Current mkdir will fail if the directory already exists.
-BUILD_DIR=$(realpath "$DIRNAME/../build") -mkdir "$BUILD_DIR" -cd "$BUILD_DIR" || exit 1 +BUILD_DIR="$(realpath "$DIRNAME/../build")" +mkdir -p "$BUILD_DIR" +cd "$BUILD_DIR" || exit 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
.dockerignore(1 hunks)docker/Dockerfile.multi(1 hunks)triton_backend/inflight_batcher_llm/scripts/build.sh(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-18T09:08:07.661Z
Learnt from: tongyuantongyu
PR: NVIDIA/TensorRT-LLM#6984
File: cpp/tensorrt_llm/CMakeLists.txt:297-299
Timestamp: 2025-08-18T09:08:07.661Z
Learning: In the TensorRT-LLM project, artifacts are manually copied rather than installed via `cmake --install`, so INSTALL_RPATH properties are not needed - only BUILD_RPATH affects the final artifacts.
Applied to files:
triton_backend/inflight_batcher_llm/scripts/build.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
.dockerignore (1)
12-12: LGTM: excluding safetensors reduces Docker build context.Adding examples/**/*.safetensors is a good call to keep contexts lean and avoid leaking large model weights into images.
triton_backend/inflight_batcher_llm/scripts/build.sh (2)
37-39: TRT_ROOT is never used; confirm intended wiring.You accept -t to set TRT_ROOT, but don’t pass it to CMake (e.g., -DTENSORRT_ROOT=...). If the build relies on auto-discovery this is fine; otherwise, wire the flag into CMake.
Would you like me to propose wiring -DTENSORRT_ROOT="${TRT_ROOT}" (or the correct variable your CMakeLists expects)?
73-76: Verify artifact locations vs install prefix.After make install with CMAKE_INSTALL_PREFIX="$(pwd)/install", the built artifacts might reside under ./install. You’re copying from the build directory. Confirm libtriton_tensorrtllm.so and trtllmExecutorWorker are present in the CWD; otherwise copy from ./install paths.
I can adjust the script to copy from the install dir if that’s the actual output location.
Signed-off-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]>
281fa74 to
d17f13a
Compare
Signed-off-by: Dimitrios Bariamis <[email protected]>
|
/bot run --skip-test |
hypdeb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some suggestions, maybe out of scope, I'll leave it to your judgement.
|
PR_Github #15916 [ run ] triggered by Bot |
|
PR_Github #15916 [ run ] completed with state |
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Summary by CodeRabbit
Chores
Refactor
Description
Building
tritonbuildortritonreleaseimage failed due to mismatching directories intriton_backend/inflight_batcher_llm/scripts/build.sh, which is called fromdocker/Dockerfile.multi. Also, some files from thedockerandjenkinsdirectories are needed forbuild.shto get the desired tag of the Triton repositories used when compiling. These have been added todocker/Dockerfile.multi.A couple of small usability improvements:
docker/Makefilenow resolves${HOME_DIR}/.cacheif it's a symlink. Without this,make -C docker xyz_run LOCAL_USER=1used to fail.examples/**/*.safetensorswas added to.dockerignorein case there are checkpoints generated by integration tests in the working directory.Edit: Saw recently merged PR #6898 by @Tabrizian with a similar fix. As discussed with @MartinMarciniszyn, I addressed the comments and rebased onto
release/1.0.GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.