[https://nvbugs/5448754][fix] Download HF model for all nodes. #6824

yuxianq · 2025-08-12T09:46:10Z

Summary by CodeRabbit

Refactor
- Coordinated model downloads in distributed (multi-GPU) runs so only one process performs the download and the result is shared with others, improving startup consistency and resource use. Added synchronized execution for download tasks across ranks.
Bug Fixes
- Reduced race conditions and partial downloads when multiple ranks attempted simultaneous fetches; added safer error handling during remote download to improve reliability.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-12T09:46:18Z

Caution

Review failed

The head commit changed during the review from 6e3c45f to 46243f2.

📝 Walkthrough

Walkthrough

Adds MPI-coordinated task submission for model-related operations in tensorrt_llm/llmapi/llm_utils.py: introduces submit_to_all_workers in CachedModelLoader and ModelLoader, replaces direct HF download calls with MPI-synchronized submissions that run the download on MPI root and return a list of results; keeps single-GPU behavior unchanged.

Changes

Cohort / File(s)	Summary of Changes
LLM API utils `tensorrt_llm/llmapi/llm_utils.py`	- Add `submit_to_all_workers` method to `CachedModelLoader` and `ModelLoader` to dispatch tasks via `mpi_session.submit_sync()` when `parallel_config.is_multi_gpu` is true, otherwise run locally and return result in a list. - Replace direct calls to `download_hf_model()` in `_load_hf_model()` implementations with `submit_to_all_workers()` so downloads are executed on MPI root in multi-GPU mode. - Add static helper `_node_download_hf_model(model: str, revision: Optional[str] = None) -> Optional[Path]`, decorated with `@print_traceback_on_error`, which performs the HF download only on `local_mpi_rank() == 0` and returns `Path` on root or `None` on other ranks. - Use the first element of the submit result as `_hf_model_dir` after MPI submission in multi-GPU setups.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CachedModelLoader
    participant MPI_Session
    participant Node0
    participant NodeN
    participant HF_Hub

    Client->>CachedModelLoader: request load HF model
    CachedModelLoader->>MPI_Session: submit_sync(_node_download_hf_model, model, revision)
    MPI_Session->>Node0: invoke _node_download_hf_model
    MPI_Session->>NodeN: invoke _node_download_hf_model
    Node0->>HF_Hub: download_hf_model(model, revision)
    HF_Hub-->>Node0: Path to model dir
    NodeN-->>MPI_Session: None
    Node0-->>MPI_Session: Path
    MPI_Session-->>CachedModelLoader: [Path, None, ...]
    CachedModelLoader->>CachedModelLoader: set _hf_model_dir = result[0]

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

yuxianq · 2025-08-12T09:49:27Z

/bot run

tensorrt-cicd · 2025-08-12T09:54:46Z

PR_Github #14948 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tensorrt_llm/llmapi/llm_utils.py (2)
630-634: Ensure all MPI ranks synchronize after model download.

The implementation correctly uses MPI to coordinate model downloads, with only the local rank 0 downloading. However, consider adding an MPI barrier after line 634 to ensure all ranks wait for the download to complete before proceeding.

Add an MPI barrier after retrieving the model directory:
 hf_model_dirs = self.mpi_session.submit_sync(
     CachedModelLoader._node_download_hf_model,
     model=self.model_loader.model_obj.model_name,
     revision=self.llm_args.revision)
 self._hf_model_dir = hf_model_dirs[0]
+# Ensure all ranks have the model directory before proceeding
+if self.mpi_session and self.mpi_session.world_size > 1:
+    self.mpi_session.barrier()
811-821: Static method implementation looks good, but consider adding error handling.

The _node_download_hf_model static method correctly implements the MPI-coordinated download pattern. However, consider wrapping the download call in a try-except block to handle potential download failures gracefully.

Consider adding error handling for download failures:
 @print_traceback_on_error
 @staticmethod
 def _node_download_hf_model(
     model: str,
     revision: Optional[str] = None,
 ) -> Optional[Path]:
     if local_mpi_rank() == 0:
-        return download_hf_model(model, revision)
+        try:
+            return download_hf_model(model, revision)
+        except Exception as e:
+            logger.error(f"Failed to download HF model '{model}': {e}")
+            raise
     else:
         return None

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a32a2e4 and bd2a9e3.

📒 Files selected for processing (1)

tensorrt_llm/llmapi/llm_utils.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/llmapi/llm_utils.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/llmapi/llm_utils.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/llmapi/llm_utils.py (2)

15-16: LGTM! Import statement correctly updated.

The addition of local_mpi_rank to the imports from .._utils is appropriate for the new MPI-coordinated download functionality.

813-816: Return type annotation is accurate

The download_hf_model implementation in tensorrt_llm/llmapi/utils.py wraps the string path returned by snapshot_download in a Path object:
def download_hf_model(...) -> Path:
    …
    hf_folder = snapshot_download(...)
    return Path(hf_folder)
In _node_download_hf_model, you return either that Path or None based on MPI rank, matching the Optional[Path] annotation. No changes are needed here.

tensorrt_llm/llmapi/llm_utils.py

tensorrt-cicd · 2025-08-12T11:11:38Z

PR_Github #14948 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #64 completed with status: 'FAILURE'

yuxianq · 2025-08-13T03:17:24Z

/bot run

tensorrt-cicd · 2025-08-13T03:23:00Z

PR_Github #15064 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

tensorrt_llm/llmapi/llm_utils.py (1)
1-1: Add NVIDIA copyright header to comply with repository standards.

Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply at the very top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+

♻️ Duplicate comments (2)

tensorrt_llm/llmapi/llm_utils.py (2)

610-620: Guard against None mpi_session in multi-GPU path to avoid AttributeError.

If multi-GPU is enabled but mpi_session is None, submit_sync will raise. Add an explicit check with a clear error message.

Apply this diff:

 def submit_to_all_workers(
     self,
     task: Callable[..., Any],
     *args,
     **kwargs,
 ) -> List[Any]:
-    if self.llm_args.parallel_config.is_multi_gpu:
-        return self.mpi_session.submit_sync(task, *args, **kwargs)
+    if self.llm_args.parallel_config.is_multi_gpu:
+        if self.mpi_session is None:
+            raise RuntimeError(
+                f"MPI session is required to run task {getattr(task, '__name__', str(task))} "
+                f"in multi-GPU mode for model={self.llm_args.model} (revision={self.llm_args.revision})"
+            )
+        return self.mpi_session.submit_sync(task, *args, **kwargs)
     else:
         return [task(*args, **kwargs)]

641-646: Pick the local node’s download dir instead of index 0; validate return values.

submit_to_all_workers executes the task on all workers; _node_download_hf_model returns a Path only on local rank 0 of each node and None elsewhere. Using hf_model_dirs[0] risks selecting a path from a different node, breaking non-shared-storage setups and contradicting the “download for all nodes” objective.

Use this process’s local node root to select the correct entry and validate:

-                hf_model_dirs = self.submit_to_all_workers(
-                    CachedModelLoader._node_download_hf_model,
-                    model=self.model_loader.model_obj.model_name,
-                    revision=self.llm_args.revision)
-                self._hf_model_dir = hf_model_dirs[0]
+                hf_model_dirs = self.submit_to_all_workers(
+                    CachedModelLoader._node_download_hf_model,
+                    model=self.model_loader.model_obj.model_name,
+                    revision=self.llm_args.revision)
+                # Choose the entry corresponding to this node's local rank-0 process
+                idx = (global_mpi_rank() - local_mpi_rank()
+                       if self.llm_args.parallel_config.is_multi_gpu else 0)
+                if not hf_model_dirs or idx < 0 or idx >= len(hf_model_dirs):
+                    raise RuntimeError(
+                        f"Invalid MPI result for HF download (len={len(hf_model_dirs) if hf_model_dirs is not None else 'None'}), "
+                        f"global_rank={global_mpi_rank()}, local_rank={local_mpi_rank()}."
+                    )
+                self._hf_model_dir = hf_model_dirs[idx]
+                if self._hf_model_dir is None:
+                    raise RuntimeError(
+                        f"Failed to download HF model for this node: {self.model_loader.model_obj.model_name} "
+                        f"(revision={self.llm_args.revision}); got None at index {idx}."
+                    )

Follow-up:

If MpiSession.submit_sync does not return results ordered by global rank, adjust selection accordingly or return the local node path to all local ranks from _node_download_hf_model.

I can update _node_download_hf_model to locally broadcast the path to all ranks on the node if there’s a per-node communicator available in utils; want me to draft that?

🧹 Nitpick comments (1)

tensorrt_llm/llmapi/llm_utils.py (1)
822-832: Verify submit_sync ordering or broadcast HF model path per node

The helper _node_download_hf_model only returns a Path on local_mpi_rank() == 0 and None elsewhere. Callers today rely on indexing into the list returned by session.submit_sync(_node_download_hf_model) using
idx = global_mpi_rank() - local_mpi_rank()
path = results[idx]
but this only works if submit_sync returns results strictly in global‐rank order.

• In MpiCommSession.submit, we invoke thread_pool.submit for rank 0 first, then call mpi_pool.submit in a loop of n_workers – 1. However, MPICommExecutor does not formally guarantee that the sequence of .submit calls maps one‐to‐one with ascending global ranks.
• If the ordering is not guaranteed, the indexing logic can pick the wrong entry.

Please verify whether MPICommExecutor’s .submit indeed dispatches tasks in global‐rank order. If it does, document that assumption. Otherwise, to simplify and harden correctness, consider broadcasting the downloaded model path to all local ranks (e.g., via an MPI.COMM_TYPE_SHARED split and mpi_barrier) so every rank on the node receives the Path directly.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2a9e3 and 322cbf8.

📒 Files selected for processing (1)

tensorrt_llm/llmapi/llm_utils.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/llmapi/llm_utils.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/llmapi/llm_utils.py

🧬 Code Graph Analysis (1)

tensorrt_llm/llmapi/llm_utils.py (3)

tensorrt_llm/_utils.py (1)

local_mpi_rank (501-502)

tensorrt_llm/llmapi/llm_args.py (2)

parallel_config (1313-1314)

is_multi_gpu (270-271)

tensorrt_llm/llmapi/utils.py (2)

download_hf_model (172-181)

print_traceback_on_error (30-41)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/llmapi/llm_utils.py (2)

10-10: Imports update looks good.

15-16: Good: local_mpi_rank is now imported for per-node logic.

tensorrt-cicd · 2025-08-13T10:31:57Z

PR_Github #15064 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #78 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

nvrohanv

Tested it, see details on https://nvbugspro.nvidia.com/bug/5448754 and it worked.

Superjomn · 2025-08-20T06:23:08Z

Can we change the HF_HOME or HF_HUB_CACHE to a cluster-shared directory instead? In a modern cluster, this is feasible. Ideally, the workflow looks like this: only the rank0 (llm-api build_model stage) downloads the HF model to that shared directory, and then it dispatches PyExecutor on all the MPI ranks, and then the ranks consume the model. This is the current status, and it seems much simpler.

tensorrt_llm/llmapi/llm_utils.py

yuxianq · 2025-08-25T09:53:43Z

/bot help

github-actions · 2025-08-25T09:53:50Z