-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[https://nvbugs/5448754][fix] Download HF model for all nodes. #6824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds MPI-coordinated task submission for model-related operations in Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant CachedModelLoader
participant MPI_Session
participant Node0
participant NodeN
participant HF_Hub
Client->>CachedModelLoader: request load HF model
CachedModelLoader->>MPI_Session: submit_sync(_node_download_hf_model, model, revision)
MPI_Session->>Node0: invoke _node_download_hf_model
MPI_Session->>NodeN: invoke _node_download_hf_model
Node0->>HF_Hub: download_hf_model(model, revision)
HF_Hub-->>Node0: Path to model dir
NodeN-->>MPI_Session: None
Node0-->>MPI_Session: Path
MPI_Session-->>CachedModelLoader: [Path, None, ...]
CachedModelLoader->>CachedModelLoader: set _hf_model_dir = result[0]
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot run |
|
PR_Github #14948 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tensorrt_llm/llmapi/llm_utils.py (2)
630-634: Ensure all MPI ranks synchronize after model download.The implementation correctly uses MPI to coordinate model downloads, with only the local rank 0 downloading. However, consider adding an MPI barrier after line 634 to ensure all ranks wait for the download to complete before proceeding.
Add an MPI barrier after retrieving the model directory:
hf_model_dirs = self.mpi_session.submit_sync( CachedModelLoader._node_download_hf_model, model=self.model_loader.model_obj.model_name, revision=self.llm_args.revision) self._hf_model_dir = hf_model_dirs[0] +# Ensure all ranks have the model directory before proceeding +if self.mpi_session and self.mpi_session.world_size > 1: + self.mpi_session.barrier()
811-821: Static method implementation looks good, but consider adding error handling.The
_node_download_hf_modelstatic method correctly implements the MPI-coordinated download pattern. However, consider wrapping the download call in a try-except block to handle potential download failures gracefully.Consider adding error handling for download failures:
@print_traceback_on_error @staticmethod def _node_download_hf_model( model: str, revision: Optional[str] = None, ) -> Optional[Path]: if local_mpi_rank() == 0: - return download_hf_model(model, revision) + try: + return download_hf_model(model, revision) + except Exception as e: + logger.error(f"Failed to download HF model '{model}': {e}") + raise else: return None
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/llmapi/llm_utils.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/llm_utils.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/llm_utils.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/llmapi/llm_utils.py (2)
15-16: LGTM! Import statement correctly updated.The addition of
local_mpi_rankto the imports from.._utilsis appropriate for the new MPI-coordinated download functionality.
813-816: Return type annotation is accurateThe
download_hf_modelimplementation intensorrt_llm/llmapi/utils.pywraps the string path returned bysnapshot_downloadin aPathobject:def download_hf_model(...) -> Path: … hf_folder = snapshot_download(...) return Path(hf_folder)In
_node_download_hf_model, you return either thatPathorNonebased on MPI rank, matching theOptional[Path]annotation. No changes are needed here.
|
PR_Github #14948 [ run ] completed with state |
|
/bot run |
|
PR_Github #15064 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tensorrt_llm/llmapi/llm_utils.py (1)
1-1: Add NVIDIA copyright header to comply with repository standards.Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply at the very top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +
♻️ Duplicate comments (2)
tensorrt_llm/llmapi/llm_utils.py (2)
610-620: Guard against None mpi_session in multi-GPU path to avoid AttributeError.If multi-GPU is enabled but mpi_session is None, submit_sync will raise. Add an explicit check with a clear error message.
Apply this diff:
def submit_to_all_workers( self, task: Callable[..., Any], *args, **kwargs, ) -> List[Any]: - if self.llm_args.parallel_config.is_multi_gpu: - return self.mpi_session.submit_sync(task, *args, **kwargs) + if self.llm_args.parallel_config.is_multi_gpu: + if self.mpi_session is None: + raise RuntimeError( + f"MPI session is required to run task {getattr(task, '__name__', str(task))} " + f"in multi-GPU mode for model={self.llm_args.model} (revision={self.llm_args.revision})" + ) + return self.mpi_session.submit_sync(task, *args, **kwargs) else: return [task(*args, **kwargs)]
641-646: Pick the local node’s download dir instead of index 0; validate return values.submit_to_all_workers executes the task on all workers; _node_download_hf_model returns a Path only on local rank 0 of each node and None elsewhere. Using hf_model_dirs[0] risks selecting a path from a different node, breaking non-shared-storage setups and contradicting the “download for all nodes” objective.
Use this process’s local node root to select the correct entry and validate:
- hf_model_dirs = self.submit_to_all_workers( - CachedModelLoader._node_download_hf_model, - model=self.model_loader.model_obj.model_name, - revision=self.llm_args.revision) - self._hf_model_dir = hf_model_dirs[0] + hf_model_dirs = self.submit_to_all_workers( + CachedModelLoader._node_download_hf_model, + model=self.model_loader.model_obj.model_name, + revision=self.llm_args.revision) + # Choose the entry corresponding to this node's local rank-0 process + idx = (global_mpi_rank() - local_mpi_rank() + if self.llm_args.parallel_config.is_multi_gpu else 0) + if not hf_model_dirs or idx < 0 or idx >= len(hf_model_dirs): + raise RuntimeError( + f"Invalid MPI result for HF download (len={len(hf_model_dirs) if hf_model_dirs is not None else 'None'}), " + f"global_rank={global_mpi_rank()}, local_rank={local_mpi_rank()}." + ) + self._hf_model_dir = hf_model_dirs[idx] + if self._hf_model_dir is None: + raise RuntimeError( + f"Failed to download HF model for this node: {self.model_loader.model_obj.model_name} " + f"(revision={self.llm_args.revision}); got None at index {idx}." + )Follow-up:
- If MpiSession.submit_sync does not return results ordered by global rank, adjust selection accordingly or return the local node path to all local ranks from _node_download_hf_model.
I can update _node_download_hf_model to locally broadcast the path to all ranks on the node if there’s a per-node communicator available in utils; want me to draft that?
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_utils.py (1)
822-832: Verifysubmit_syncordering or broadcast HF model path per nodeThe helper
_node_download_hf_modelonly returns aPathonlocal_mpi_rank() == 0andNoneelsewhere. Callers today rely on indexing into the list returned bysession.submit_sync(_node_download_hf_model)usingidx = global_mpi_rank() - local_mpi_rank() path = results[idx]but this only works if
submit_syncreturns results strictly in global‐rank order.• In
MpiCommSession.submit, we invokethread_pool.submitfor rank 0 first, then callmpi_pool.submitin a loop ofn_workers – 1. However, MPICommExecutor does not formally guarantee that the sequence of.submitcalls maps one‐to‐one with ascending global ranks.
• If the ordering is not guaranteed, the indexing logic can pick the wrong entry.Please verify whether MPICommExecutor’s
.submitindeed dispatches tasks in global‐rank order. If it does, document that assumption. Otherwise, to simplify and harden correctness, consider broadcasting the downloaded model path to all local ranks (e.g., via anMPI.COMM_TYPE_SHAREDsplit andmpi_barrier) so every rank on the node receives thePathdirectly.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/llmapi/llm_utils.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/llmapi/llm_utils.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/llmapi/llm_utils.py
🧬 Code Graph Analysis (1)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/_utils.py (1)
local_mpi_rank(501-502)tensorrt_llm/llmapi/llm_args.py (2)
parallel_config(1313-1314)is_multi_gpu(270-271)tensorrt_llm/llmapi/utils.py (2)
download_hf_model(172-181)print_traceback_on_error(30-41)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/llmapi/llm_utils.py (2)
10-10: Imports update looks good.
15-16: Good: local_mpi_rank is now imported for per-node logic.
|
PR_Github #15064 [ run ] completed with state |
nvrohanv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it, see details on https://nvbugspro.nvidia.com/bug/5448754 and it worked.
|
Can we change the |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run --disable-fail-fast |
|
PR_Github #16436 [ run ] triggered by Bot |
|
PR_Github #16436 [ run ] completed with state |
Signed-off-by: Yuxian Qiu <[email protected]>
46243f2 to
7e78862
Compare
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-7" |
|
PR_Github #16520 [ run ] triggered by Bot |
|
PR_Github #16520 [ run ] completed with state |
Signed-off-by: Yuxian Qiu <[email protected]>
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-7" |
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6824) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Summary by CodeRabbit
Refactor
Bug Fixes
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.