-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][fix] cherrypick to main: Fix possible mpi broadcast and gather issue on large object #7854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][fix] cherrypick to main: Fix possible mpi broadcast and gather issue on large object #7854
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #19261 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughAdds chunked MPI-based safe_broadcast and safe_gather utilities with serialization, headers, and chunked payload transfer. Integrates these into MPIDist.broadcast/tp_broadcast/tp_gather with new chunk_size parameter. Includes ENABLE_MULTI_DEVICE guards, mpi4py requirement, input validation, and error propagation across ranks. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App as Caller
participant Root as Root Rank
participant Peer as Non-root Ranks
participant MPI as MPI Communicator
App->>Root: safe_broadcast(obj, root, chunk_size)
Note over Root,Peer: If ENABLE_MULTI_DEVICE is False, return early
Root->>Root: pickle.dumps(obj) or set error flag
Root->>MPI: Bcast([ok_flag, total_size, num_chunks])
Peer->>MPI: Bcast(header recv)
alt ok_flag == 1
loop for each chunk
Root->>MPI: Bcast(chunk bytes)
Peer->>MPI: Bcast(recv chunk)
end
Peer->>Peer: pickle.loads(buffer)
else error
Note over Root,Peer: All ranks raise serialization error
end
sequenceDiagram
autonumber
participant All as All Ranks
participant Root as Root Rank
participant MPI as MPI Communicator
Note over All: safe_gather(obj, root, chunk_size)
All->>All: pickle.dumps(local_obj), get local_size
All->>MPI: allgather(local_size) -> sizes[]
All->>All: compute displacements, num_chunks
loop for each chunk round
All->>MPI: Gatherv(chunk bytes to root)
end
Root->>Root: slice concatenated buffer per rank
Root->>Root: pickle.loads per rank -> list[objects]
Note over All: Non-root return None (or per implementation)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (8)
tensorrt_llm/_torch/distributed/communicator.py (8)
1-1: Consider sorting imports for better readability.While not a requirement, grouping standard library imports together would improve readability.
11-14: Improve exception handling specificity.Catching a generic
Exceptioncan mask other unexpected errors. Consider catching more specific exceptions likeImportErrororModuleNotFoundError.Apply this diff:
try: from mpi4py import MPI -except Exception: +except ImportError: MPI = None # deferred; functions will error if used when ENABLE_MULTI_DEVICE is True
121-123: Early return optimization for disabled multi-device.The function checks
not ENABLE_MULTI_DEVICEfirst (Line 119-120) and returns, then checksENABLE_MULTI_DEVICE and MPI is None(Line 121-123). The second condition can be simplified since we already knowENABLE_MULTI_DEVICEis True at this point.Apply this diff:
if not ENABLE_MULTI_DEVICE: return obj -if ENABLE_MULTI_DEVICE and MPI is None: +if MPI is None: raise RuntimeError( "mpi4py is required when ENABLE_MULTI_DEVICE is True")
168-168: Remove unused loop variable.The loop variable
iis not used in the loop body.Apply this diff:
-for i in range(num_chunks): +for _ in range(num_chunks):
225-227: Simplify ENABLE_MULTI_DEVICE check.Similar to
safe_broadcast, the second condition can be simplified.Apply this diff:
if not ENABLE_MULTI_DEVICE: return [obj] -if ENABLE_MULTI_DEVICE and MPI is None: +if MPI is None: raise RuntimeError( "mpi4py is required when ENABLE_MULTI_DEVICE is True")
233-234: Remove duplicate validation.The
chunk_size <= 0validation appears twice (lines 228-229 and 233-234).Apply this diff to remove the duplicate:
rank = comm.Get_rank() size = comm.Get_size() -if chunk_size <= 0: - raise ValueError("chunk_size must be > 0")
242-242: Remove unnecessary int() cast.The value being cast is already an integer from
np.int64().Apply this diff:
-_ = comm.allgather(int(-1)) +_ = comm.allgather(-1)
309-311: Consider consistent handling of empty payloads.The function returns
Nonefor ranks with zero-size data. Consider documenting this behavior or usingpickle.loads(b'')to maintain consistency with the serialization format.Please clarify if returning
Nonefor empty payloads is the intended behavior, or if you want to preserve whatever was originally serialized (even if it was an empty object).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/distributed/communicator.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/distributed/communicator.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/distributed/communicator.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/distributed/communicator.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/distributed/communicator.py (1)
tensorrt_llm/_utils.py (9)
mpi_allgather(540-541)mpi_barrier(526-528)mpi_comm(482-483)mpi_isend(544-549)mpi_isend_object(572-575)mpi_recv(560-564)mpi_recv_object(578-581)mpi_send(552-557)mpi_send_object(567-569)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/distributed/communicator.py
13-13: Do not catch blind exception: Exception
(BLE001)
122-123: Avoid specifying long messages outside the exception class
(TRY003)
125-125: Avoid specifying long messages outside the exception class
(TRY003)
142-142: Avoid specifying long messages outside the exception class
(TRY003)
142-142: Use explicit conversion flag
Replace with conversion flag
(RUF010)
151-151: Avoid specifying long messages outside the exception class
(TRY003)
168-168: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
189-191: Avoid specifying long messages outside the exception class
(TRY003)
193-193: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
195-195: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Use explicit conversion flag
Replace with conversion flag
(RUF010)
198-200: Avoid specifying long messages outside the exception class
(TRY003)
202-202: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
204-204: Avoid specifying long messages outside the exception class
(TRY003)
204-204: Use explicit conversion flag
Replace with conversion flag
(RUF010)
226-227: Avoid specifying long messages outside the exception class
(TRY003)
229-229: Avoid specifying long messages outside the exception class
(TRY003)
234-234: Avoid specifying long messages outside the exception class
(TRY003)
242-242: Value being cast to int is already an integer
Remove unnecessary int call
(RUF046)
243-243: Avoid specifying long messages outside the exception class
(TRY003)
250-250: Avoid specifying long messages outside the exception class
(TRY003)
250-250: f-string without any placeholders
Remove extraneous f prefix
(F541)
316-316: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
318-319: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/distributed/communicator.py (5)
193-193: Security consideration: pickle deserialization.The use of
pickle.loads()can execute arbitrary code if the data is from an untrusted source. Since this is an MPI communication between trusted ranks within the same application, the risk is minimal. Thenoseccomments acknowledge this security consideration.Also applies to: 202-202, 316-316
283-283: Verify correct displacement calculation for chunked gather.The displacement calculation
round_displs64 = displs + np.minimum(np.maximum(lengths, 0), round_offs)appears to add the round offset to the base displacement. This logic seems correct for keeping track of where each rank's data should go in the receive buffer across rounds.
105-205: Well-implemented chunked broadcast solution.The
safe_broadcastimplementation properly handles:
- Error propagation across all ranks via the header mechanism
- Memory-efficient chunking to avoid large allocations
- Proper cleanup and validation of data sizes
- Clear separation between root and non-root logic
207-323: Robust chunked gather implementation.The
safe_gatherimplementation effectively handles:
- Deterministic round calculation ensuring all ranks execute the same number of Gatherv operations
- Proper displacement and count calculations for variable-size payloads
- Error handling that maintains collective alignment
- Efficient use of Allgather for metadata sharing
332-334: Updated MPIDist methods preserve backward compatibility.The addition of
chunk_sizeparameters with sensible defaults (4MB) tobroadcast,tp_gather, andtp_broadcastmethods maintains backward compatibility while enabling the new chunked transfer functionality.Also applies to: 374-376, 378-380
|
PR_Github #19261 [ run ] completed with state |
…ct (NVIDIA#7507) Signed-off-by: Dongxu Yang <[email protected]>
f23a211 to
67d9573
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #19336 [ run ] triggered by Bot |
|
PR_Github #19336 [ run ] completed with state |
… issue on large object (NVIDIA#7854) Signed-off-by: Dongxu Yang <[email protected]>
… issue on large object (NVIDIA#7854) Signed-off-by: Dongxu Yang <[email protected]>
…ct (#7507)
Summary by CodeRabbit
Description
Cherrypick PR #7507 to main.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.