Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Oct 28, 2025

Mirrored from ggml-org/llama.cpp#16828

Added Tensor Core to the code from ggml-org/llama.cpp#16088, have made modification such that it was giving best result on tensor cores. Below result are on RTX 2070 gpu.

FP16 Tensor Core perf

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     55 runs - 18401.09 us/run - 137.42 GFLOP/run -   7.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               28424 runs -    35.24 us/run - 133.69 MFLOP/run -   3.79 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19899 runs -    50.62 us/run - 135.78 MFLOP/run -   2.68 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                122880 runs -     8.58 us/run - 642.82 kFLOP/run -  74.95 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    28.19 us/run -  20.90 MFLOP/run - 741.40 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                57344 runs -    18.43 us/run -   2.78 MFLOP/run - 151.07 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   134.73 us/run -  22.28 MFLOP/run - 165.35 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               28611 runs -    34.96 us/run - 115.40 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4251 runs -   235.69 us/run - 923.24 MFLOP/run -   3.92 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3465 runs -   293.17 us/run -   1.85 GFLOP/run -   6.31 TFLOPS

@etasnadi @Green-Sky @JohannesGaessler

@loci-agentic-ai-dev
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: llama.cpp PR #7 - CUDA Conv2D Tensor Core Implementation

Key Findings

Performance Degradations

  • Worst Response Time Degradation: _Vector_impl_data@plt (+0.066%, 7.37ns)
  • Worst Throughput Degradation: _Vector_impl_data@plt (+0.066%, 7.37ns)
  • Worst Bottleneck Degradation: _ZSt10_ConstructISt6vectorI21llama_grammar_elementSaIS1_EEJRKS3_EEvPT_DpOT0_ (+0.131%, 19.67ns)

Critical Assessment: These degradations are measurement artifacts rather than actual performance regressions. Analysis reveals:

  • Identical assembly code and CFG structure between versions
  • Changes confined to STL container operations and dynamic linking overhead
  • No impact on core llama.cpp inference functions

Core Function Impact Analysis

Based on the project structure analysis, the reported degradations affect:

  • Non-critical components: STL vector constructors and PLT stubs
  • No impact on core functions: Primary inference functions (llama_encode(), llama_decode(), llama_model_load_from_file()) remain unaffected
  • Peripheral systems: Grammar element vector construction in regex compilation context

Power Consumption Analysis

  • Overall Impact: Negligible power consumption change across all binaries
  • libllama.so: -0.0% change (303,377.74 nJ vs 303,379.20 nJ baseline)
  • Core computation libraries: No measurable power consumption change
  • Assessment: Changes are within measurement noise levels, indicating no significant algorithmic modifications

Technical Analysis Insights

Flame Graph Analysis:

  • Single-frame execution profile for degraded function (7ns total runtime)
  • No sub-function calls or complex branching
  • Represents optimal PLT stub implementation with minimal computational overhead

CFG Comparison:

  • Perfect structural match: Identical control flow graphs between versions
  • Byte-for-byte identical assembly: No code changes detected
  • Root cause: 0.066% degradation represents measurement precision limitations rather than actual performance changes

GitHub Code Review - PR #7 Critical Findings:

  • Major Addition: 373-line CUDA Tensor Core implementation for Conv2D operations
  • Performance Target: 3.79-7.47 TFLOPS on RTX 2070 (significant improvement)
  • Architecture: Replaces FP16 convolution path with hardware-optimized tensor core kernels

Overall Assessment

Change Impact Evaluation

Positive Aspects:

  • Significant Performance Gains: Tensor Core implementation delivers substantial throughput improvements for supported hardware
  • Hardware Optimization: Proper utilization of NVIDIA Tensor Cores for mixed-precision operations
  • Adaptive Selection: Runtime kernel variant selection optimizes resource utilization

Technical Quality:

  • Sophisticated Implementation: Demonstrates advanced CUDA programming with WMMA intrinsics and shared memory optimization
  • API Compatibility: Maintains existing function signatures and integration patterns
  • Performance Focus: Addresses critical performance bottlenecks in convolution operations

Maintainability and Future Considerations

Maintainability Strengths:

  • Modular Design: New tensor core implementation isolated in separate files
  • Backward Compatibility: Maintains fallback to original implementation for F32 operations

Areas Requiring Attention:

  • Hardware Dependency: Requires Tensor Core capable GPUs (Volta/Turing/Ampere+)
  • Code Complexity: 373 lines of highly optimized CUDA code increases maintenance overhead
  • Documentation Gap: Limited inline documentation for complex kernel logic

Future Performance Considerations:

  • Scalability: Implementation should handle diverse problem sizes efficiently
  • Hardware Evolution: Code structure supports future tensor core architecture improvements
  • Memory Optimization: Shared memory usage patterns may require tuning for future GPU generations

Final Verdict

The reported performance degradations are false positives caused by measurement precision limitations. The actual changes in PR #7 represent a significant performance enhancement for CUDA-enabled convolution operations. The implementation demonstrates high technical quality with appropriate hardware optimization strategies.

Recommendation: Proceed with PR #7 integration, focusing on validation of tensor core performance improvements rather than investigating the reported PLT stub degradations, which represent measurement noise rather than actual performance issues.

@DajanaV DajanaV force-pushed the main branch 2 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13
@DajanaV DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025
@DajanaV DajanaV deleted the branch main October 30, 2025 15:25
@DajanaV DajanaV closed this Oct 30, 2025
@DajanaV DajanaV deleted the upstream-PR16828-branch_mnehete32-conv2d_tensor_core branch October 30, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev-stale Stale dev environment — dashboard not accessible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants