UPSTREAM PR #19230: cuda: skip CUDA_SCALE_LAUNCH_QUEUES scaling on Jetson/Tegra by loci-dev · Pull Request #1113 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-31T13:46:59Z

Note

Source pull request: ggml-org/llama.cpp#19230

Summary

On Jetson Orin AGX (SM87, unified memory), the CUDA_SCALE_LAUNCH_QUEUES=4x environment variable introduced in #19042 causes llama-server to deadlock during warmup decode. This affects all MoE models tested (GLM-4.7-Flash, Qwen3-Coder-30B-A3B).

This PR detects Tegra/Jetson systems at runtime via /sys/devices/soc0/family (which returns "Tegra" on all Jetson devices) and skips the launch queue scaling. This preserves the multi-GPU pipeline parallelism optimization on desktop systems while avoiding the deadlock on Jetson unified memory.

Alternative to full revert in #19227.

Environment tested

Component	Version
Device	Jetson AGX Orin 64GB
JetPack / L4T	R36.4.7
Kernel	5.15.148-tegra
NVIDIA Driver	540.4.0
CUDA Toolkit	12.6.2 (V12.6.77)
cuDNN	9.4.0.58
GCC/G++	11.4.0
CMake	4.1.0

Build flags

cmake -B build -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=87 \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_F16=ON \
  -DCMAKE_BUILD_TYPE=Release

Verification

Without fix: server hangs indefinitely on warmup decode
With fix: 20.4 tok/s generation, works perfectly
Bisect: 9-step bisect confirmed a83c73a18 as the single breaking commit

How the fix works

Before any CUDA API call, reads /sys/devices/soc0/family which is a sysfs node available on all Linux systems. On Tegra/Jetson it returns "Tegra", on desktop/server systems the file either doesn't exist or returns a different value. No CUDA calls needed for detection.

Fixes #19219

loci-review · 2026-01-31T14:38:53Z

No meaningful performance changes were detected across 115327 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

The CUDA_SCALE_LAUNCH_QUEUES=4x optimization for multi-GPU pipeline parallelism causes deadlocks in the MMQ kernel path on single-GPU systems (tested on Jetson Orin AGX 64GB, SM87, unified memory). Root cause: FORCE_CUBLAS=ON (which bypasses MMQ kernels) avoids the deadlock entirely, confirming this is specific to the custom matrix multiply kernel dispatch with scaled launch queues on single-GPU systems — not a general CUDA driver issue. Fix: count NVIDIA GPU device nodes via /dev/nvidia* before any CUDA API call. Only apply the 4x scaling when multiple GPUs are detected. This preserves the multi-GPU pipeline parallelism optimization while avoiding the single-GPU deadlock. On Windows, the scaling is applied unconditionally (preserving current behavior) with a TODO to add device counting via registry/SetupAPI. Tested on Jetson AGX Orin 64GB (SM87, CUDA 12.6, driver 540.4.0): - Without fix: deadlocks on warmup decode with MoE models - With fix: 20.8 tok/s generation, no deadlock - CUDA_SCALE_LAUNCH_QUEUES=2x also deadlocks (any scaling > 1x) - FORCE_CUBLAS=ON avoids deadlock even with 4x (confirms MMQ path) Fixes: ggml-org/llama.cpp#19219

loci-review · 2026-01-31T19:35:47Z

No meaningful performance changes were detected across 115327 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-gemma3-cli, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-dev temporarily deployed to PROD__AL_DEMO January 31, 2026 13:47 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 6050f97 to 124d3f7 Compare January 31, 2026 14:10

loci-dev force-pushed the main branch 4 times, most recently from bf8b018 to dcfc127 Compare January 31, 2026 18:12

loci-dev force-pushed the loci/pr-19230-fix-skip-launch-queue-scaling-tegra branch from dc6b403 to e20398b Compare January 31, 2026 18:46

loci-dev temporarily deployed to PROD__AL_DEMO January 31, 2026 18:47 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from dcfc127 to e04dda7 Compare January 31, 2026 19:09

loci-dev force-pushed the main branch 18 times, most recently from 6a853c2 to 8603870 Compare February 1, 2026 14:11

loci-dev force-pushed the main branch 28 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 2 times, most recently from 0cb533b to ef7afbe Compare February 13, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19230: cuda: skip CUDA_SCALE_LAUNCH_QUEUES scaling on Jetson/Tegra#1113

UPSTREAM PR #19230: cuda: skip CUDA_SCALE_LAUNCH_QUEUES scaling on Jetson/Tegra#1113
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19230-fix-skip-launch-queue-scaling-tegra

loci-dev commented Jan 31, 2026

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 31, 2026

Summary

Environment tested

Build flags

Verification

How the fix works

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

loci-review bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants