[Bugfix] Disable TMA on Blackwell GPUs to fix Triton autotuner OOM in fla/solve_trilfix: disable TMA on Blackwell (sm_12x) to prevent Triton autotuner OO… by Rks2302 · Pull Request #36325 · vllm-project/vllm

Rks2302 · 2026-03-07T11:33:45Z

Summary

Fixes Triton autotuner OOM crash in fla/ops/solve_tril.py when running
Qwen3.5 models on Blackwell GPUs (RTX 5090, compute capability sm_12x).

Root Cause

is_tma_supported evaluates to True on any GPU with compute capability >= 9,
which includes Blackwell (sm_12x). During first inference, the Triton autotuner
benchmarks the merge_fn kernel in solve_tril with TMA enabled, causing
oversized descriptor buffer allocations that OOM even when model weights fit
comfortably in VRAM.

Error

RuntimeError: Triton Error [CUDA]: out of memory
File "fla/ops/solve_tril.py", line 545, in solve_tril
merge_fn[NT, B * H](..., USE_TMA=is_tma_supported)
File "triton/runtime/autotuner.py"
timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}

Fix

Add upper bound < 12 to restrict TMA only to Hopper (sm_90x). TMA works
correctly on Hopper but causes Triton autotuner OOM on Blackwell (sm_12x).

Testing

GPU: NVIDIA RTX 5090 (Blackwell, sm_120)
vLLM: 0.17.0
Model: Qwen3.5-35B-A3B-AWQ, Qwen3.5-27B-AWQ
CUDA: 12.8

After this fix, Qwen3.5 AWQ models run successfully on RTX 5090 without
--enforce-eager. Full inference pipeline verified working.

Related Issues

[Bug]: Qwen 3.5 27B AWQ 4bit capturing CUDA graph fails #35743 (Qwen3.5 AWQ CUDA graph failures on Blackwell)
[Bug]: AssertionError in causal_conv1d_update when capturing CUDA graphs for Qwen3.5/GDN layers #35945 (causal_conv1d AssertionError - related GDN bug)

github-actions · 2026-03-07T11:33:55Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request addresses an out-of-memory error on Blackwell GPUs by disabling Tensor Memory Access (TMA) for this architecture. The fix is correct and targeted. My review includes a suggestion to refactor the implementation of the check to improve its readability and efficiency by avoiding a redundant function call.

gemini-code-assist · 2026-03-07T11:35:36Z

vllm/model_executor/layers/fla/ops/utils.py

 is_tma_supported = (is_nvidia and torch.cuda.get_device_capability(0)[0] >= 9) and (
    hasattr(triton.language, "_experimental_make_tensor_descriptor")
    or hasattr(triton.language, "make_tensor_descriptor")
-)
+) and torch.cuda.get_device_capability(0)[0] < 12  # Disable on Blackwell (sm_12x): Triton autotuner OOM


While this fix is correct, the implementation can be improved for readability and efficiency. The expression for is_tma_supported now calls torch.cuda.get_device_capability(0)[0] twice and the formatting makes the line very long. It's better to combine the two compute capability checks into a single range check to avoid the redundant call and make the condition clearer.

Suggested change

is_tma_supported = (is_nvidia and torch.cuda.get_device_capability(0)[0] >= 9) and (

hasattr(triton.language, "_experimental_make_tensor_descriptor")

or hasattr(triton.language, "make_tensor_descriptor")

)

) and torch.cuda.get_device_capability(0)[0] < 12 # Disable on Blackwell (sm_12x): Triton autotuner OOM

# Disable on Blackwell (sm_12x): Triton autotuner OOM

is_tma_supported = (is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12) and (

hasattr(triton.language, "_experimental_make_tensor_descriptor")

or hasattr(triton.language, "make_tensor_descriptor")

)

mergify · 2026-03-07T11:38:14Z

Hi @Rks2302, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-03-07T11:45:29Z

Hi @Rks2302, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

…M in solve_tril Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>

Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>

ZJY0516 · 2026-03-09T12:23:01Z

I remember Hopper also had this OOM issue. We should find a better way to both avoid OOM and maintain performance.

mergify bot added the bug Something isn't working label Mar 7, 2026

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

Rks2302 added 3 commits March 7, 2026 18:40

fix: disable TMA on Blackwell (sm_12x) to prevent Triton autotuner OO…

163f559

…M in solve_tril Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>

fix: use range check for TMA support to avoid redundant capability call

95d07ff

Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>

fix: apply pre-commit formatting fixes

ae40230

Signed-off-by: Rks2302 <rahulksharma2302@gmail.com>

Rks2302 force-pushed the fix/blackwell-tma-oom branch from efd1eeb to ae40230 Compare March 7, 2026 13:10

RobTand mentioned this pull request Mar 20, 2026

[Bugfix] Fix FLA Hopper/TMA misclassification on SM12x desktop Blackwell #37700

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Disable TMA on Blackwell GPUs to fix Triton autotuner OOM in fla/solve_trilfix: disable TMA on Blackwell (sm_12x) to prevent Triton autotuner OO…#36325

[Bugfix] Disable TMA on Blackwell GPUs to fix Triton autotuner OOM in fla/solve_trilfix: disable TMA on Blackwell (sm_12x) to prevent Triton autotuner OO…#36325
Rks2302 wants to merge 3 commits intovllm-project:mainfrom
Rks2302:fix/blackwell-tma-oom

Rks2302 commented Mar 7, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

mergify bot commented Mar 7, 2026

Uh oh!

mergify bot commented Mar 7, 2026

Uh oh!

ZJY0516 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Rks2302 commented Mar 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Error

Fix

Testing

Related Issues

Uh oh!

github-actions bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 7, 2026

Uh oh!

mergify bot commented Mar 7, 2026

Uh oh!

ZJY0516 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rks2302 commented Mar 7, 2026 •

edited by github-actions bot

Loading