UPSTREAM PR #19230: cuda: skip CUDA_SCALE_LAUNCH_QUEUES scaling on Jetson/Tegra#1113
UPSTREAM PR #19230: cuda: skip CUDA_SCALE_LAUNCH_QUEUES scaling on Jetson/Tegra#1113
Conversation
|
No meaningful performance changes were detected across 115327 analyzed functions in the following binaries: build.bin.libllama.so, build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so. 🔎 Full breakdown: Loci Inspector. |
bf8b018 to
dcfc127
Compare
The CUDA_SCALE_LAUNCH_QUEUES=4x optimization for multi-GPU pipeline parallelism causes deadlocks in the MMQ kernel path on single-GPU systems (tested on Jetson Orin AGX 64GB, SM87, unified memory). Root cause: FORCE_CUBLAS=ON (which bypasses MMQ kernels) avoids the deadlock entirely, confirming this is specific to the custom matrix multiply kernel dispatch with scaled launch queues on single-GPU systems — not a general CUDA driver issue. Fix: count NVIDIA GPU device nodes via /dev/nvidia* before any CUDA API call. Only apply the 4x scaling when multiple GPUs are detected. This preserves the multi-GPU pipeline parallelism optimization while avoiding the single-GPU deadlock. On Windows, the scaling is applied unconditionally (preserving current behavior) with a TODO to add device counting via registry/SetupAPI. Tested on Jetson AGX Orin 64GB (SM87, CUDA 12.6, driver 540.4.0): - Without fix: deadlocks on warmup decode with MoE models - With fix: 20.8 tok/s generation, no deadlock - CUDA_SCALE_LAUNCH_QUEUES=2x also deadlocks (any scaling > 1x) - FORCE_CUBLAS=ON avoids deadlock even with 4x (confirms MMQ path) Fixes: ggml-org/llama.cpp#19219
dc6b403 to
e20398b
Compare
|
No meaningful performance changes were detected across 115327 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tts, build.bin.libllama.so, build.bin.llama-bench, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-gemma3-cli, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli. 🔎 Full breakdown: Loci Inspector. |
6a853c2 to
8603870
Compare
048ad94 to
6c1fde6
Compare
0cb533b to
ef7afbe
Compare
Note
Source pull request: ggml-org/llama.cpp#19230
Summary
On Jetson Orin AGX (SM87, unified memory), the
CUDA_SCALE_LAUNCH_QUEUES=4xenvironment variable introduced in #19042 causesllama-serverto deadlock during warmup decode. This affects all MoE models tested (GLM-4.7-Flash, Qwen3-Coder-30B-A3B).This PR detects Tegra/Jetson systems at runtime via
/sys/devices/soc0/family(which returns"Tegra"on all Jetson devices) and skips the launch queue scaling. This preserves the multi-GPU pipeline parallelism optimization on desktop systems while avoiding the deadlock on Jetson unified memory.Alternative to full revert in #19227.
Environment tested
Build flags
Verification
a83c73a18as the single breaking commitHow the fix works
Before any CUDA API call, reads
/sys/devices/soc0/familywhich is a sysfs node available on all Linux systems. On Tegra/Jetson it returns"Tegra", on desktop/server systems the file either doesn't exist or returns a different value. No CUDA calls needed for detection.Fixes #19219