Skip to content

UPSTREAM PR #17602: common : add minimalist multi-thread progress bar#368

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17602-branch_angt-common-add-minimalist-multi-thread-progress-bar
Open

UPSTREAM PR #17602: common : add minimalist multi-thread progress bar#368
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17602-branch_angt-common-add-minimalist-multi-thread-progress-bar

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17602

I intentionally kept the bar simple without specifying part numbers (which ultimately don't matter much) the only thing we care about is tracking progress

@loci-review
Copy link

loci-review bot commented Nov 29, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #368

Overview

PR #368 introduces multi-threaded progress bar functionality to common/download.cpp, adding mutex-based synchronization and ANSI terminal sequences for concurrent download progress display. The modification affects the print_progress function, which is not part of the inference pipeline.

Key Findings

Impacted Function:

  • print_progress: Response time increased by 10508 ns (604 ns → 11113 ns) in llama-cvector-generator and 10433 ns (608 ns → 11041 ns) in llama-tts. Throughput increased by 301 ns and 275 ns respectively.

Code Changes:
The implementation adds thread-safe progress tracking using std::mutex and std::map<std::thread::id, int> for line assignment, plus ANSI escape sequences for cursor positioning. The response time increase stems from mutex acquisition (20-50 ns), map lookup operations (30-50 ns), and console I/O for ANSI sequences (5-8 microseconds across multiple std::cout calls).

Inference Impact:
No impact on tokens per second. The print_progress function operates during model loading and downloading operations, not during inference execution. Functions responsible for tokenization and inference (llama_decode, llama_encode, llama_tokenize) remain unmodified. The performance change is isolated to progress reporting, which occurs outside the token generation pipeline.

Power Consumption:

  • llama-tts: 0.321% increase (720 nJ total)
  • llama-cvector-generator: 0.159% increase (350 nJ total)

The power increase reflects the cumulative throughput changes in progress reporting functions. Since progress updates occur infrequently during downloads rather than continuously during inference, the total energy impact per operation remains minimal.

Context:
The 18x response time increase is confined to user-facing progress display during file operations. The added synchronization overhead enables clean multi-threaded progress bars without affecting model inference performance or token generation rates.

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 1854a53 to 1b177fe Compare November 30, 2025 15:08
@loci-dev loci-dev force-pushed the upstream-PR17602-branch_angt-common-add-minimalist-multi-thread-progress-bar branch from 3f49035 to 09717b6 Compare November 30, 2025 18:40
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@loci-review
Copy link

loci-review bot commented Nov 30, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #368

Overview

PR #368 introduces a multi-threaded progress bar for download operations in common/download.cpp. The changes are isolated to the HTTPLIB download path and do not modify any performance-critical inference components.

Analysis Scope

Modified Components:

  • Single file: common/download.cpp (+71 lines, -25 lines)
  • Affected path: Download utility (non-inference)
  • Core inference functions: No changes

Performance Metrics Context:
The observed performance regressions in previous analyses (std::map operations showing 67-210% throughput increases, std::mutex overhead) correlate with this PR's introduction of std::map<const ProgressBar*, int> and std::mutex for progress tracking. However, these changes occur exclusively in the download code path, which is not part of the inference pipeline.

Key Findings

Impact on Inference Performance

Tokens per Second: No Impact

The PR does not modify any tokenization or inference functions:

  • llama_decode - Not modified
  • llama_encode - Not modified
  • llama_tokenize - Not modified
  • llama_detokenize - Not modified

All changes are confined to common_pull_file() and related download utilities. These functions execute during model download operations, not during inference. Therefore, tokens per second remains unchanged regardless of the reference metric (7% reduction per 2 ms slowdown in llama_decode).

Affected Functions Analysis

Download Path Functions:

The ProgressBar::update() method introduces overhead in the download path:

  • Mutex acquisition: 20-50 ns per update (uncontended)
  • Map lookup: 60 ns per update (increased from 36 ns baseline)
  • Total per-update overhead: 100-200 ns

With throttling limiting updates to approximately 1000 per download, the cumulative overhead is 100-200 microseconds per file download. This occurs only during model acquisition, not during inference execution.

Power Consumption Analysis

Affected Binaries:

From previous power analysis, two binaries showed measurable increases:

  • llama-tts: +846 nJ (+0.377%)
  • llama-cvector-generator: +496 nJ (+0.225%)

These increases correlate with STL container operations (std::map, std::mutex) introduced in this PR. However, the power impact manifests only during download operations. The inference binaries (libllama.so, llama-run) show no measurable power consumption change (0.0%), confirming that inference performance remains unaffected.

Code Change Characterization

The PR transforms stateless progress functions into a stateful ProgressBar class with:

  • Static mutex for thread synchronization
  • Static map tracking concurrent progress bars by instance pointer
  • RAII-based cleanup ensuring proper resource management
  • ANSI escape sequences for multi-line terminal output

This architectural change enables concurrent downloads with independent progress tracking. The implementation is thread-safe, exception-safe, and properly handles cleanup through destructors.

Conclusion

PR #368 successfully implements multi-threaded progress tracking for downloads without impacting inference performance. The observed performance regressions in std::map and std::mutex operations are confined to the download utility path. Tokens per second, model loading performance, and inference execution remain unchanged. The power consumption increases in utility binaries reflect download-time overhead only and do not affect inference workloads.

@loci-dev loci-dev force-pushed the upstream-PR17602-branch_angt-common-add-minimalist-multi-thread-progress-bar branch from 09717b6 to 4387ab2 Compare November 30, 2025 20:36
@loci-review
Copy link

loci-review bot commented Nov 30, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #368

Overview

PR #368 introduces a multi-threaded progress bar for download operations in common/download.cpp. The change refactors a simple function-based progress display into a class-based implementation with thread-safety mechanisms. Analysis shows no impact on inference performance or tokens per second, as the modifications are isolated to the download module.

Scope Assessment

Modified Components:

  • 1 file: common/download.cpp
  • 0 core inference functions affected
  • Changes limited to download progress display logic

Performance-Critical Areas Impact:

  • Model Processing Module: No changes
  • Token Processing Module: No changes
  • Memory Management Module: No changes
  • Batch Processing Module: No changes
  • Inference functions (llama_decode, llama_encode, llama_tokenize): No changes

Condition: This analysis falls under Condition 1 - no changes in performance metrics for inference operations.

Analysis Results

Function-Level Changes:
The 10 functions with highest response time changes are all STL container operations (std::_Rb_tree::begin, std::vector::_S_max_size, std::_Hashtable::begin) in binaries llama-cvector-generator and llama-tts. These changes are unrelated to PR #368 and represent compiler optimization differences between build versions.

Code Changes in PR #368:

  • Replaced static print_progress() function with ProgressBar class
  • Added mutex and map for thread-safe multi-line progress tracking
  • Introduced ANSI escape sequences for cursor management
  • Changes affect only download display, not model inference

Inference Performance Impact:
No functions in the inference pipeline were modified. The functions llama_decode, llama_encode, and llama_tokenize show no response time or throughput changes. Tokens per second remains unaffected.

Power Consumption:
Two binaries show minimal power increases: llama-tts (+846 nJ, +0.377%) and llama-cvector-generator (+496 nJ, +0.225%). These changes are attributed to STL container operations, not the progress bar implementation. The download module is not invoked during inference.

Conclusion:
PR #368 has zero impact on inference performance or tokens per second. All observed performance variations are in non-inference binaries and unrelated to the progress bar changes.

@loci-dev loci-dev force-pushed the main branch 14 times, most recently from 56f593b to eb7b6bf Compare December 2, 2025 10:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from df48f9e to cb46586 Compare December 6, 2025 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants