Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Oct 29, 2025

Mirrored from ggml-org/llama.cpp#16796

Overview

We see crashing at ggml_vk_guess_matmul_id_pipeline_align when a vk_matmul_pipeline has all empty pipelines.
crash_point

This happens because the following logic in ggml_vk_get_mul_mat_mat_id_pipeline is checking for a nullptr which seems to never happen. The reference is there (thus thinks that fp16acc is supported) but all members are actually empty which leads to a crash later on in ggml_vk_guess_matmul_id_pipeline_align

    bool support_fp16acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f16acc != nullptr;
    bool support_fp32acc = ctx->device->pipeline_dequant_mul_mat_mat_id[src0_type].f32acc != nullptr;
mimatch

We originally found this issue when running ggml-org/gpt-oss-20b-GGUF for src_type0 as GGML_TYPE_MXFP4 though should happen on potentially any src_type0.

Reproducing steps

Crash currently occurs on following unit test with Windows LunarLake driver 32.0.101.5730. Should reproduce on Ubuntu with mesa 25.0.7-0ubuntu0.24.04.2 drivers as well.

  • build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)
λ build\bin\Debug\test-backend-ops.exe -o MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=4,k=256,o=1)
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) 140V GPU (16GB) (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-vulkan.dll
register_backend: registered backend Vulkan (1 devices)
register_device: registered device Vulkan0 (Intel(R) Arc(TM) 140V GPU (16GB))
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll score: 108
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-haswell.dll score: 44
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-icelake.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sandybridge.dll score: 17
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-skylakex.dll score: 0
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-sse42.dll score: 5
ggml_backend_load_best: C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-x64.dll score: 1
load_backend: loaded CPU backend from C:\Users\Administrator\repo\llama.cpp_mine\build\bin\Debug\ggml-cpu-alderlake.dll
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Core(TM) Ultra 7 268V 2.20GHz)
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) 140V GPU (16GB)
  Device memory: 16191 MB (17677 MB free)

(crashes here)

@loci-agentic-ai-dev
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #10

Key Findings

Performance Impact Assessment

  • Minimal Performance Degradation: All identified performance changes are sub-nanosecond level and within measurement variance
    • Response Time: Worst degradation in _Vector_impl_data@plt (+0.066%, +0.005ns)
    • Throughput: Worst degradation in __invoke_r@plt (+0.066%, +0.005ns)
    • Bottleneck: Worst degradation in _Construct (+0.131%, +0.026ns)
  • Power Consumption: Negligible decrease of 0.55nJ (-0.0002%) indicating stable energy efficiency

Core Function Impact Analysis

  • No Impact on Critical Inference Path: The performance degradations occur in standard library PLT stubs and C++ constructor functions, not in llama.cpp's core inference components
  • Vulkan Backend Isolation: Code changes are confined to ggml-vulkan.cpp pipeline selection logic, separate from CPU-based inference and core model operations
  • Matrix Operations Preserved: No changes to performance-critical matrix multiplication kernels, attention mechanisms, or quantization routines

Technical Analysis Results

Flame Graph Analysis:

  • Single-level execution profile for degraded function confirms PLT stub behavior
  • 7.37ns execution time represents minimal, optimized constructor overhead
  • No recursive patterns or complex branching detected

CFG Comparison:

  • Identical control flow graphs between versions for the degraded function
  • Byte-for-byte identical assembly code in PLT resolution
  • Performance variance attributed to micro-architectural timing rather than code changes

Code Review Findings:

  • Crash Prevention: PR addresses critical Vulkan backend crash on Intel Arc GPUs and Mesa drivers
  • Hardware Compatibility: Enhanced capability detection for FP16 accumulation support
  • Defensive Programming: Improved error handling for unsupported GPU configurations

Critical Issues Identified

  • Assertion Risk: New GGML_ASSERT(false) may cause abrupt termination on unsupported hardware
  • Limited Hardware Validation: Changes tested primarily on Intel Arc 140V GPU configuration
  • Pipeline Assumption: Fallback logic assumes complete pipeline availability for certain hardware types

Actionable Steps

Priority 1: Risk Mitigation

  1. Replace Hard Assertions: Convert GGML_ASSERT(false) to graceful CPU backend fallback

    // Recommended change in ggml-vulkan.cpp:5117
    if (!ctx->device->coopmat_acc_f16_support && !ctx->device->coopmat_acc_f32_support) {
        return nullptr; // Allow graceful fallback
    }
  2. Expand Hardware Testing: Validate changes on AMD and NVIDIA GPU configurations to prevent regressions

Priority 2: Monitoring and Validation

  1. Add Debug Logging: Implement pipeline selection decision logging for troubleshooting
  2. Runtime Validation: Add pipeline integrity checks during device initialization
  3. Performance Benchmarking: Profile FP16 vs FP32 accumulation performance across hardware variants

Priority 3: Code Quality Enhancement

  1. Capability Matrix: Implement comprehensive hardware capability detection framework
  2. Error Recovery: Design robust fallback mechanisms for partial GPU support scenarios

Overall Assessment

Change Impact Evaluation

  • Positive Reliability Impact: Eliminates critical crashes on specific GPU configurations while maintaining performance
  • Minimal Performance Cost: Sub-nanosecond timing variations are within normal measurement noise
  • Targeted Scope: Changes are well-isolated to Vulkan backend without affecting core inference algorithms

Maintainability Considerations

  • Code Complexity: Moderate increase in conditional logic balanced by improved hardware compatibility
  • Technical Debt: Enhanced capability detection reduces future debugging overhead
  • Documentation Need: Hardware compatibility matrix documentation recommended

Future Performance Outlook

  • Stable Foundation: Core llama.cpp inference performance remains unaffected
  • Scalability Preserved: Matrix multiplication and attention mechanisms maintain optimization potential
  • Hardware Evolution Ready: Enhanced capability detection framework supports future GPU architectures

Recommendation: Approve PR with suggested assertion handling improvements. The reliability gains significantly outweigh the minimal complexity increase, and performance impact is negligible for the core inference pipeline.

@DajanaV DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13
@DajanaV DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025
@DajanaV DajanaV deleted the branch main October 30, 2025 15:25
@DajanaV DajanaV closed this Oct 30, 2025
@DajanaV DajanaV deleted the upstream-PR16796-branch_rillomas-fix-accf16-capability-crash branch October 30, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev-stale Stale dev environment — dashboard not accessible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants