Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16941

Add a new model openPangu-Embedded-1/7B-V1.1.
Yu can get the the model from model path.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Based on my analysis of the performance data and code changes, here's the comprehensive performance impact assessment:

Performance Analysis Summary

Critical Function Performance Changes

Primary Performance Degradation

  • Function: std::vector<std::pair<...>>::begin() (regex_automaton.h:873-874)
  • Response Time: 208% increase (85ns → 261ns)
  • Throughput: 282% degradation (62ns → 239ns self-time)
  • Root Cause: Assembly analysis reveals missing stack canary load instruction and inefficient block splitting

Secondary Performance Impact

  • Function: std::vector<unsigned int>::back() (regex_automaton.h:1233-1237)
  • Bottleneck: 538% increase (32ns → 204ns)
  • Throughput: 266% degradation (70ns → 255ns)

KPI Impact Analysis

1. Tokens Per Second Impact

Status: No Direct Impact Expected

Analysis: The degraded functions are located in regex processing components, not in core inference functions:

  • llama_decode() - No performance changes detected
  • llama_encode() - No performance changes detected
  • llama_tokenize() - No performance changes detected

Conclusion: Based on the reference that 2ms slower llama_decode() results in 7% tokens/second reduction, the current regex-related degradation should not impact inference throughput as these functions are not in the critical inference path.

2. Power Consumption Impact

Status: Minimal Impact

Affected Binary: build.bin.libllama.so

  • Power Consumption Change: +0.169% (280,662nJ → 281,136nJ)
  • Absolute Delta: +475nJ increase
  • Other Binaries: No measurable changes across remaining 15 binaries

Root Cause: Increased CPU cycles from inefficient STL container operations in regex processing.

3. Quantization Efficiency

Status: No Impact

Analysis: No changes detected in quantization-related functions:

  • llama_model_quantize() - No performance changes
  • Quantization format handling - Unchanged
  • GGUF loading mechanisms - Stable performance

4. Memory Usage

Status: Potential Indirect Impact

Affected Areas:

  • Regex Container Operations: Degraded std::vector::begin() and back() operations suggest potential memory access inefficiencies
  • Stack Management: Missing stack canary load instruction indicates memory protection overhead
  • Container Growth: Performance degradation may correlate with increased container sizes in tokenization patterns

No Direct Impact on core memory management functions:

  • llama_memory_clear() - No changes
  • llama_memory_seq_rm() - No changes
  • KV cache operations - Stable performance

5. Batch Processing

Status: No Impact

Analysis: Core batch processing functions show no performance degradation:

  • llama_batch_init() - No changes
  • llama_batch_get_one() - No changes
  • llama_decode() with batches - No changes

Root Cause Analysis

Assembly-Level Issues

  1. Missing Stack Canary Load: Critical security instruction absent in current version
  2. Inefficient Block Splitting: Entry block unnecessarily divided, increasing execution overhead
  3. Compiler Optimization Regression: Suboptimal code generation affecting STL container performance

Code Changes Context

The performance degradation appears unrelated to PR #69 (PanguEmbedded model addition), suggesting:

  • Compiler optimization changes between builds
  • Build system modifications affecting STL container code generation
  • Separate changes not captured in the analyzed pull request

Action Items

Immediate Code-Level Actions

  1. Investigate Compiler Settings: Review optimization flags and STL container code generation between versions
  2. Assembly Code Analysis: Restore missing stack canary load instruction in std::vector::begin() implementation
  3. Block Structure Optimization: Eliminate unnecessary block splitting in entry sequences

Build System Actions

  1. Compiler Version Verification: Ensure consistent compiler versions and optimization settings
  2. STL Implementation Check: Verify STL library versions and container implementation consistency
  3. Debug Symbol Analysis: Compare debug information between versions to identify optimization differences

Performance Monitoring Focus

  1. Regex Processing Paths: Monitor tokenization performance for patterns using regex automaton
  2. Container Operation Profiling: Track STL container performance in text processing pipelines
  3. Memory Access Patterns: Analyze cache efficiency in container operations

Conclusion

The identified performance regression is isolated to regex processing components and does not directly impact core inference performance metrics. The 0.169% power consumption increase in libllama.so represents minimal system impact. Primary concern is the missing security instruction which requires immediate attention for both performance and security reasons.

@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 6196a56 to 39290d7 Compare November 16, 2025 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants