Skip to content

[Bugfix] Disable TMA on Blackwell GPUs to fix Triton autotuner OOM in fla/solve_trilfix: disable TMA on Blackwell (sm_12x) to prevent Triton autotuner OO…#36325

Open
Rks2302 wants to merge 3 commits intovllm-project:mainfrom
Rks2302:fix/blackwell-tma-oom

Conversation

@Rks2302
Copy link
Copy Markdown

@Rks2302 Rks2302 commented Mar 7, 2026

Summary

Fixes Triton autotuner OOM crash in fla/ops/solve_tril.py when running
Qwen3.5 models on Blackwell GPUs (RTX 5090, compute capability sm_12x).

Root Cause

is_tma_supported evaluates to True on any GPU with compute capability >= 9,
which includes Blackwell (sm_12x). During first inference, the Triton autotuner
benchmarks the merge_fn kernel in solve_tril with TMA enabled, causing
oversized descriptor buffer allocations that OOM even when model weights fit
comfortably in VRAM.

Error

RuntimeError: Triton Error [CUDA]: out of memory
File "fla/ops/solve_tril.py", line 545, in solve_tril
merge_fn[NT, B * H](..., USE_TMA=is_tma_supported)
File "triton/runtime/autotuner.py"
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}

Fix

Add upper bound < 12 to restrict TMA only to Hopper (sm_90x). TMA works
correctly on Hopper but causes Triton autotuner OOM on Blackwell (sm_12x).

Testing

  • GPU: NVIDIA RTX 5090 (Blackwell, sm_120)
  • vLLM: 0.17.0
  • Model: Qwen3.5-35B-A3B-AWQ, Qwen3.5-27B-AWQ
  • CUDA: 12.8

After this fix, Qwen3.5 AWQ models run successfully on RTX 5090 without
--enforce-eager. Full inference pipeline verified working.

Related Issues

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 7, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the bug Something isn't working label Mar 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an out-of-memory error on Blackwell GPUs by disabling Tensor Memory Access (TMA) for this architecture. The fix is correct and targeted. My review includes a suggestion to refactor the implementation of the check to improve its readability and efficiency by avoiding a redundant function call.

Comment on lines +155 to +158
is_tma_supported = (is_nvidia and torch.cuda.get_device_capability(0)[0] >= 9) and (
hasattr(triton.language, "_experimental_make_tensor_descriptor")
or hasattr(triton.language, "make_tensor_descriptor")
)
) and torch.cuda.get_device_capability(0)[0] < 12 # Disable on Blackwell (sm_12x): Triton autotuner OOM
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this fix is correct, the implementation can be improved for readability and efficiency. The expression for is_tma_supported now calls torch.cuda.get_device_capability(0)[0] twice and the formatting makes the line very long. It's better to combine the two compute capability checks into a single range check to avoid the redundant call and make the condition clearer.

Suggested change
is_tma_supported = (is_nvidia and torch.cuda.get_device_capability(0)[0] >= 9) and (
hasattr(triton.language, "_experimental_make_tensor_descriptor")
or hasattr(triton.language, "make_tensor_descriptor")
)
) and torch.cuda.get_device_capability(0)[0] < 12 # Disable on Blackwell (sm_12x): Triton autotuner OOM
# Disable on Blackwell (sm_12x): Triton autotuner OOM
is_tma_supported = (is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12) and (
hasattr(triton.language, "_experimental_make_tensor_descriptor")
or hasattr(triton.language, "make_tensor_descriptor")
)

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 7, 2026

Hi @Rks2302, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

1 similar comment
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 7, 2026

Hi @Rks2302, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Rks2302 added 3 commits March 7, 2026 18:40
…M in solve_tril

Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>
Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>
Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>
@Rks2302 Rks2302 force-pushed the fix/blackwell-tma-oom branch from efd1eeb to ae40230 Compare March 7, 2026 13:10
@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented Mar 9, 2026

I remember Hopper also had this OOM issue. We should find a better way to both avoid OOM and maintain performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants