Skip to content

UPSTREAM PR #19230: cuda: skip CUDA_SCALE_LAUNCH_QUEUES scaling on Jetson/Tegra#1113

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19230-fix-skip-launch-queue-scaling-tegra
Open

UPSTREAM PR #19230: cuda: skip CUDA_SCALE_LAUNCH_QUEUES scaling on Jetson/Tegra#1113
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19230-fix-skip-launch-queue-scaling-tegra

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19230

Summary

On Jetson Orin AGX (SM87, unified memory), the CUDA_SCALE_LAUNCH_QUEUES=4x environment variable introduced in #19042 causes llama-server to deadlock during warmup decode. This affects all MoE models tested (GLM-4.7-Flash, Qwen3-Coder-30B-A3B).

This PR detects Tegra/Jetson systems at runtime via /sys/devices/soc0/family (which returns "Tegra" on all Jetson devices) and skips the launch queue scaling. This preserves the multi-GPU pipeline parallelism optimization on desktop systems while avoiding the deadlock on Jetson unified memory.

Alternative to full revert in #19227.

Environment tested

Component Version
Device Jetson AGX Orin 64GB
JetPack / L4T R36.4.7
Kernel 5.15.148-tegra
NVIDIA Driver 540.4.0
CUDA Toolkit 12.6.2 (V12.6.77)
cuDNN 9.4.0.58
GCC/G++ 11.4.0
CMake 4.1.0

Build flags

cmake -B build -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=87 \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_F16=ON \
  -DCMAKE_BUILD_TYPE=Release

Verification

  • Without fix: server hangs indefinitely on warmup decode
  • With fix: 20.4 tok/s generation, works perfectly
  • Bisect: 9-step bisect confirmed a83c73a18 as the single breaking commit

How the fix works

Before any CUDA API call, reads /sys/devices/soc0/family which is a sysfs node available on all Linux systems. On Tegra/Jetson it returns "Tegra", on desktop/server systems the file either doesn't exist or returns a different value. No CUDA calls needed for detection.

Fixes #19219

@loci-review
Copy link

loci-review bot commented Jan 31, 2026

No meaningful performance changes were detected across 115327 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 4 times, most recently from bf8b018 to dcfc127 Compare January 31, 2026 18:12
The CUDA_SCALE_LAUNCH_QUEUES=4x optimization for multi-GPU pipeline
parallelism causes deadlocks in the MMQ kernel path on single-GPU
systems (tested on Jetson Orin AGX 64GB, SM87, unified memory).

Root cause: FORCE_CUBLAS=ON (which bypasses MMQ kernels) avoids the
deadlock entirely, confirming this is specific to the custom matrix
multiply kernel dispatch with scaled launch queues on single-GPU
systems — not a general CUDA driver issue.

Fix: count NVIDIA GPU device nodes via /dev/nvidia* before any CUDA
API call. Only apply the 4x scaling when multiple GPUs are detected.
This preserves the multi-GPU pipeline parallelism optimization while
avoiding the single-GPU deadlock.

On Windows, the scaling is applied unconditionally (preserving current
behavior) with a TODO to add device counting via registry/SetupAPI.

Tested on Jetson AGX Orin 64GB (SM87, CUDA 12.6, driver 540.4.0):
- Without fix: deadlocks on warmup decode with MoE models
- With fix: 20.8 tok/s generation, no deadlock
- CUDA_SCALE_LAUNCH_QUEUES=2x also deadlocks (any scaling > 1x)
- FORCE_CUBLAS=ON avoids deadlock even with 4x (confirms MMQ path)

Fixes: ggml-org/llama.cpp#19219
@loci-dev loci-dev force-pushed the loci/pr-19230-fix-skip-launch-queue-scaling-tegra branch from dc6b403 to e20398b Compare January 31, 2026 18:46
@loci-review
Copy link

loci-review bot commented Jan 31, 2026

No meaningful performance changes were detected across 115327 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-gemma3-cli, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 18 times, most recently from 6a853c2 to 8603870 Compare February 1, 2026 14:11
@loci-dev loci-dev force-pushed the main branch 28 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 0cb533b to ef7afbe Compare February 13, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants