- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.8k
[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures #6836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 📝 WalkthroughWalkthroughAdds an env-driven UCXX fetch override in CMake, exposes TRITON build args and switches base registry in docker Makefile, refactors Jenkins build to use a retry wrapper and mirror rewrite, updates L0 test image/mirror logic, adds auto_trigger in an integration YAML, and removes CUDA_HOME auto-default in a unit test. Changes
 Sequence Diagram(s)sequenceDiagram
  autonumber
  participant Jenkins
  participant Groovy as BuildDockerImage.groovy
  participant Utils as trtllm_utils.llmExecStepWithRetry
  participant Docker
  participant Make
  Jenkins->>Groovy: Start pipeline
  Groovy->>Groovy: Parse BASE_IMAGE/TRITON_IMAGE/TRITON_BASE_TAG from files
  Groovy->>Groovy: Override BASE_IMAGE for rockylinux8 if present
  Groovy->>Groovy: Rewrite registry nvcr.io -> urm.nvidia.com/docker
  Groovy->>Utils: llmExecStepWithRetry(docker pull/build ...)
  Utils->>Docker: Execute pull/build (with retries)
  Docker-->>Utils: result
  Utils-->>Groovy: status
  Groovy->>Utils: llmExecStepWithRetry(make target ... BASE_IMAGE/TRITON_IMAGE ...)
  Utils->>Make: Execute make with args
  Make-->>Utils: result
  Utils-->>Groovy: status
  Groovy-->>Jenkins: Build complete
sequenceDiagram
  autonumber
  participant CMake
  participant Env
  participant FS as fetch_rapids.cmake
  CMake->>CMake: Find UCX
  CMake->>Env: Check GITHUB_MIRROR
  alt UCX found and GITHUB_MIRROR set and file exists
    CMake->>FS: Read contents
    CMake->>FS: Replace rapids-cmake URL with mirror URL
    CMake->>FS: Write contents
    CMake-->>CMake: Emit warning about replacement
  end
  CMake-->>CMake: Continue UCXX setup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
 Suggested reviewers
 Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration: 
 You can enable these settings in your CodeRabbit configuration. 📒 Files selected for processing (6)
 💤 Files with no reviewable changes (1)
 🚧 Files skipped from review as they are similar to previous changes (4)
 ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
 🔇 Additional comments (4)
 ✨ Finishing Touches
 🧪 Generate unit tests
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type  Other keywords and placeholders
 Status, Documentation and Community
 | 
| /bot run --skip-test --extra-stage "Build-Docker-Images" | 
| PR_Github #14995 [ run ] triggered by Bot | 
| /bot run --stage-list "Build-Docker-Images" | 
    
      
        1 similar comment
      
    
  
    | /bot run --stage-list "Build-Docker-Images" | 
| PR_Github #14999 [ run ] triggered by Bot | 
| PR_Github #14995 [ run ] completed with state  | 
| PR_Github #14999 [ run ] completed with state  | 
| /bot run --stage-list "Build-Docker-Images" | 
| PR_Github #15224 [ run ] triggered by Bot | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
jenkins/BuildDockerImage.groovy (1)
347-353: Wrap custom tag stage with the same retry strategyThe custom tag flow still uses a plain sh call. For consistency and resiliency, wrap it with llmExecStepWithRetry and pre-pull the TRITON base.
Apply this diff:
- stage ("custom tag: ${customTag} (${arch})") { - sh """ - cd ${LLM_ROOT} && make -C docker ${target}_${action} - BASE_IMAGE=${BASE_IMAGE} - TRITON_IMAGE=${TRITON_IMAGE} - TORCH_INSTALL_TYPE=${torchInstallType} - IMAGE_WITH_TAG=${customImageWithTag} - STAGE=${dockerfileStage} - BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args} - """ - } + stage ("custom tag: ${customTag} (${arch})") { + def randomSleep = (Math.random() * 300 + 300).toInteger() + trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, shortCommondRunTimeMax: 7200) + trtllm_utils.llmExecStepWithRetry(this, script: """ + cd ${LLM_ROOT} && make -C docker ${target}_${action} \ + BASE_IMAGE=${BASE_IMAGE} \ + TRITON_IMAGE=${TRITON_IMAGE} \ + TORCH_INSTALL_TYPE=${torchInstallType} \ + IMAGE_WITH_TAG=${customImageWithTag} \ + STAGE=${dockerfileStage} \ + BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args} + """, sleepInSecs: randomSleep, numRetries: 3, shortCommondRunTimeMax: 7200) + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
- cpp/CMakeLists.txt(1 hunks)
- docker/Makefile(3 hunks)
- jenkins/BuildDockerImage.groovy(4 hunks)
- jenkins/L0_Test.groovy(1 hunks)
- tests/integration/test_lists/test-db/l0_dgx_h100.yml(1 hunks)
- tests/unittest/test_pip_install.py(0 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/test_pip_install.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
🔇 Additional comments (9)
jenkins/L0_Test.groovy (1)
42-42: Mirror switch looks fine; ensure cluster can pull from URMThe DLFW image now points to URM. Verify that K8s nodes running these stages have imagePullSecrets or registry access to urm.nvidia.com; otherwise pulls will fail at runtime.
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
218-221: Aligns with mako auto_trigger flowAdding auto_trigger: others to the Triton post_merge block matches how getMakoArgsFromStageName emits auto_trigger and will correctly filter these tests.
cpp/CMakeLists.txt (1)
498-508: Action needed — cannot verify: fetch_rapids.cmake not found in repo; please confirm or apply safer rewriteI searched the repository and did not find fetch_rapids.cmake, so I cannot confirm whether it always uses branch refs (tags/SHAs may be used). Please confirm the file path (or whether it's generated at build time). If the file is present/managed here, apply the safer rewrite below to avoid breaking non-branch refs and to reduce noisy logs.
Files/locations to check:
- cpp/CMakeLists.txt (the snippet at lines ~498–508 that rewrites rapids-cmake URLs)
- Expected target file: ${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake (may be absent or generated at build time)
Suggested patch:
- if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "") - if(EXISTS "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake") - file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS) - string( - REPLACE "https://raw.githubusercontent.com/rapidsai/rapids-cmake" - "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads" - FILE_CONTENTS "${FILE_CONTENTS}") - file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}") - message(WARNING "Replace UCXX fetch_rapids.cmake with ${FILE_CONTENTS}") - endif() - endif() + if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "") + set(_rapids_fetch "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake") + if(EXISTS "${_rapids_fetch}") + file(READ "${_rapids_fetch}" _FILE_CONTENTS) + set(_OLD "https://raw.githubusercontent.com/rapidsai/rapids-cmake") + # Use generic /raw so tags and SHAs also work + set(_NEW "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw") + string(REPLACE "${_OLD}" "${_NEW}" _NEW_CONTENTS "${_FILE_CONTENTS}") + if(NOT "${_NEW_CONTENTS}" STREQUAL "${_FILE_CONTENTS}") + file(WRITE "${_rapids_fetch}" "${_NEW_CONTENTS}") + message(STATUS "Rewrote rapids-cmake URLs in ${_rapids_fetch} to ${_NEW}/*") + else() + message(STATUS "No rapids-cmake URL to rewrite in ${_rapids_fetch}") + endif() + endif() + endif()docker/Makefile (3)
4-5: Good: plumbs TRITON metadata from Dockerfile.multiDeriving TRITON_IMAGE and TRITON_BASE_TAG here keeps Makefile self-sufficient when Jenkins doesn’t pass them explicitly.
85-87: Build args: pass TRITON metadata through to DockerfilePlumbing TRITON_IMAGE and TRITON_BASE_TAG into docker build ensures consistent base pull. Matches Jenkins’ pre-pull behavior.
194-204: Base image source switched to nvcr.io/nvidia/cuda for RL8/Ubuntu22 targetsLooks correct and aligns with the Jenkins rewrite to URM at runtime. No functional issues.
jenkins/BuildDockerImage.groovy (3)
261-261: Minor: env | sort is useful for debuggingThis helps reproducibility and auditing of build context. Keep it.
298-309: LGTM: pre-pull and retry wrapper mitigate external flakinessPre-pulling the TRITON base with randomized backoff is a solid way to reduce thundering-herd and registry spikes.
326-337: LGTM: main build also benefits from pre-pull + retryConsistent use of the retry helper and staggered start times should improve success rates for multi-job builds.
| /bot run --stage-list "Build-Docker-Images" | 
| PR_Github #15286 [ run ] triggered by Bot | 
… failures Signed-off-by: Yanchao Lu <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Yanchao Lu <[email protected]>
…first Signed-off-by: Yanchao Lu <[email protected]>
| PR_Github #15286 [ run ] completed with state  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| Is it expected that the mirror will fix the ucxx issue? Judging by the failed pipeline, the ucxx problem will still occur. | 
| /bot skip --comment "LLM/main/L0_MergeRequest_PR pipeline #11539 + LLM/main/L0_MergeRequest pipeline #29153" | 
| PR_Github #15378 [ skip ] triggered by Bot | 
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Summary by CodeRabbit
New Features
Chores
Tests