[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs #5964

ixlmar · 2025-07-11T12:21:31Z

nvbugs-5318143 fix: restrict PyTorch memory usage to avoid OOMs

Description

Please explain the issue and the solution in short.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

ixlmar · 2025-07-11T12:26:35Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

ixlmar · 2025-07-11T12:57:04Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1"

tensorrt-cicd · 2025-07-11T13:02:48Z

PR_Github #11651 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-11T15:41:05Z

PR_Github #11651 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #229 (Partly Tested) completed with status: 'SUCCESS'

ixlmar · 2025-07-11T21:53:21Z

/bot run

tensorrt-cicd · 2025-07-11T21:58:46Z

PR_Github #11676 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-12T07:23:16Z

PR_Github #11676 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #234 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tensorrt_llm/_torch/pyexecutor/_util.py

juney-nvidia · 2025-07-14T08:26:44Z

This bug has been pushed to 1.0, so pls target this PR to land to the main branch directly rather than 0.21.

Signed-off-by: ixlmar <[email protected]>

ixlmar · 2025-07-14T08:45:47Z

/bot run

tensorrt-cicd · 2025-07-14T08:50:57Z

PR_Github #11790 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T15:10:03Z

PR_Github #11790 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8735 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…IDIA#5964) Signed-off-by: ixlmar <[email protected]>

ixlmar requested review from MartinMarciniszyn and dcampora July 11, 2025 12:32

ixlmar marked this pull request as ready for review July 11, 2025 14:19

ixlmar requested review from a team as code owners July 11, 2025 14:19

QiJune requested a review from HuiGao-NV July 13, 2025 11:01

QiJune reviewed Jul 13, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/_util.py Show resolved Hide resolved

ixlmar force-pushed the fix/nvbugs-5318143 branch from ce1303a to 71fc568 Compare July 14, 2025 08:44

ixlmar changed the base branch from release/0.21 to main July 14, 2025 08:44

fix: restrict PyTorch memory usage to avoid OOMs

560b281

Signed-off-by: ixlmar <[email protected]>

ixlmar force-pushed the fix/nvbugs-5318143 branch from 71fc568 to 560b281 Compare July 14, 2025 08:45

ixlmar requested a review from QiJune July 14, 2025 10:10

HuiGao-NV approved these changes Jul 14, 2025

View reviewed changes

HuiGao-NV merged commit f225f5c into NVIDIA:main Jul 14, 2025
3 checks passed

ixlmar deleted the fix/nvbugs-5318143 branch July 15, 2025 06:05

evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Jul 16, 2025

[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs (NV…

e863069

…IDIA#5964) Signed-off-by: ixlmar <[email protected]>

ixlmar mentioned this pull request Jul 16, 2025

[fix] Update jenkins container images #6094

Merged

[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs #5964

[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs #5964

Uh oh!

Conversation

ixlmar commented Jul 11, 2025

nvbugs-5318143 fix: restrict PyTorch memory usage to avoid OOMs

Description

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

ixlmar commented Jul 11, 2025

Uh oh!

ixlmar commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

ixlmar commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 12, 2025

Uh oh!

Uh oh!

juney-nvidia commented Jul 14, 2025

Uh oh!

ixlmar commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants