UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10

DajanaV · 2025-10-29T02:35:54Z

Overview

We see crashing at ggml_vk_guess_matmul_id_pipeline_align when a vk_matmul_pipeline has all empty pipelines.

This happens because the following logic in ggml_vk_get_mul_mat_mat_id_pipeline is checking for a nullptr which seems to never happen. The reference is there (thus thinks that fp16acc is supported) but all members are actually empty which leads to a crash later on in ggml_vk_guess_matmul_id_pipeline_align

    bool support_fp16acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f16acc != nullptr;
    bool support_fp32acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f32acc != nullptr;

We originally found this issue when running ggml-org/gpt-oss-20b-GGUF for src_type0 as GGML_TYPE_MXFP4 though should happen on potentially any src_type0.

Reproducing steps

Crash currently occurs on following unit test with Windows LunarLake driver 32.0.101.5730. Should reproduce on Ubuntu with mesa 25.0.7-0ubuntu0.24.04.2 drivers as well.

build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)

λ build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-vulkan.dll
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Arc(TM) 140V GPU (16GB))
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll score: 108
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-haswell.dll score: 44
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-icelake.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sandybridge.dll score: 17
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-skylakex.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sse42.dll score: 5
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-x64.dll score: 1
load_backend: loaded CPU backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) Ultra 7 268V 2.20GHz)
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 16191 MB (17677 MB free)

(crashes here)

loci-agentic-ai-dev · 2025-10-29T03:47:43Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #10

Key Findings

Performance Impact Assessment

Minimal Performance Degradation: All identified performance changes are sub-nanosecond level and within measurement variance
- Response Time: Worst degradation in _Vector_impl_data@plt (+0.066%, +0.005ns)
- Throughput: Worst degradation in __invoke_r@plt (+0.066%, +0.005ns)
- Bottleneck: Worst degradation in _Construct (+0.131%, +0.026ns)
Power Consumption: Negligible decrease of 0.55nJ (-0.0002%) indicating stable energy efficiency

Core Function Impact Analysis

No Impact on Critical Inference Path: The performance degradations occur in standard library PLT stubs and C++ constructor functions, not in llama.cpp's core inference components
Vulkan Backend Isolation: Code changes are confined to ggml-vulkan.cpp pipeline selection logic, separate from CPU-based inference and core model operations
Matrix Operations Preserved: No changes to performance-critical matrix multiplication kernels, attention mechanisms, or quantization routines

Technical Analysis Results

Flame Graph Analysis:

Single-level execution profile for degraded function confirms PLT stub behavior
7.37ns execution time represents minimal, optimized constructor overhead
No recursive patterns or complex branching detected

CFG Comparison:

Identical control flow graphs between versions for the degraded function
Byte-for-byte identical assembly code in PLT resolution
Performance variance attributed to micro-architectural timing rather than code changes

Code Review Findings:

Crash Prevention: PR addresses critical Vulkan backend crash on Intel Arc GPUs and Mesa drivers
Hardware Compatibility: Enhanced capability detection for FP16 accumulation support
Defensive Programming: Improved error handling for unsupported GPU configurations

Critical Issues Identified

Assertion Risk: New GGML_ASSERT(false) may cause abrupt termination on unsupported hardware
Limited Hardware Validation: Changes tested primarily on Intel Arc 140V GPU configuration
Pipeline Assumption: Fallback logic assumes complete pipeline availability for certain hardware types

Actionable Steps

Priority 1: Risk Mitigation

Replace Hard Assertions: Convert GGML_ASSERT(false) to graceful CPU backend fallback

// Recommended change in ggml-vulkan.cpp:5117
if (!ctx->device->coopmat_acc_f16_support && !ctx->device->coopmat_acc_f32_support) {
    return nullptr; // Allow graceful fallback
}

Expand Hardware Testing: Validate changes on AMD and NVIDIA GPU configurations to prevent regressions

Priority 2: Monitoring and Validation

Add Debug Logging: Implement pipeline selection decision logging for troubleshooting
Runtime Validation: Add pipeline integrity checks during device initialization
Performance Benchmarking: Profile FP16 vs FP32 accumulation performance across hardware variants

Priority 3: Code Quality Enhancement

Capability Matrix: Implement comprehensive hardware capability detection framework
Error Recovery: Design robust fallback mechanisms for partial GPU support scenarios

Overall Assessment

Change Impact Evaluation

Positive Reliability Impact: Eliminates critical crashes on specific GPU configurations while maintaining performance
Minimal Performance Cost: Sub-nanosecond timing variations are within normal measurement noise
Targeted Scope: Changes are well-isolated to Vulkan backend without affecting core inference algorithms

Maintainability Considerations

Code Complexity: Moderate increase in conditional logic balanced by improved hardware compatibility
Technical Debt: Enhanced capability detection reduces future debugging overhead
Documentation Need: Hardware compatibility matrix documentation recommended

Future Performance Outlook

Stable Foundation: Core llama.cpp inference performance remains unaffected
Scalability Preserved: Matrix multiplication and attention mechanisms maintain optimization potential
Hardware Evolution Ready: Enhanced capability detection framework supports future GPU architectures

Recommendation: Approve PR with suggested assertion handling improvements. The reliability gains significantly outweigh the minimal complexity increase, and performance impact is negligible for the core inference pipeline.

rillomas and others added 3 commits October 28, 2025 17:28

Experimenting crash fix

5abfd16

added assert for aborting and fixed comment

c5474a6

Merge branch 'ggml-org:master' into fix-accf16-capability-crash

7661f3b

DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13

DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025

DajanaV deleted the branch main October 30, 2025 15:25

DajanaV closed this Oct 30, 2025

DajanaV deleted the upstream-PR16796-branch_rillomas-fix-accf16-capability-crash branch October 30, 2025 15:25

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10

UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10

Uh oh!

DajanaV commented Oct 29, 2025

Uh oh!

loci-agentic-ai-dev bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10

UPSTREAM PR #16796: vulkan: Fix crash when FP16 mul_mat accumulation is not supported #10

Uh oh!

Conversation

DajanaV commented Oct 29, 2025

Overview

Reproducing steps

Uh oh!

loci-agentic-ai-dev bot commented Oct 29, 2025

Performance Analysis Summary: llama.cpp PR #10

Key Findings

Performance Impact Assessment

Core Function Impact Analysis

Technical Analysis Results

Critical Issues Identified

Actionable Steps

Priority 1: Risk Mitigation

Priority 2: Monitoring and Validation

Priority 3: Code Quality Enhancement

Overall Assessment

Change Impact Evaluation

Maintainability Considerations

Future Performance Outlook

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants