Skip to content

UPSTREAM PR #17535: refactor : use common download in tools/run#341

Open
loci-dev wants to merge 4 commits intomainfrom
upstream-PR17535-branch_angt-refactor-download
Open

UPSTREAM PR #17535: refactor : use common download in tools/run#341
loci-dev wants to merge 4 commits intomainfrom
upstream-PR17535-branch_angt-refactor-download

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17535

  • update the common/download interface to be directly usable by tools/run (removing duplicated code).
  • fix ollama downloads by implementing manual redirect handling (addressing issues with cpp-httplib).

angt added 4 commits November 26, 2025 22:26
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@loci-review
Copy link

loci-review bot commented Nov 27, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #341

Overview

PR #341 refactors download functionality in tools/run by removing 368 lines of duplicated HTTP code and delegating to the centralized common/download interface. The changes introduce manual redirect handling for Ollama downloads and modernize the header API from string-based to type-safe pairs.

Key Findings

Performance-Critical Functions Impact

Model Initialization Path:

  • LlamaData::resolve_model: Response time increased by 4,710,905 ns (4.7 ms), throughput decreased by 73,977 ns
  • LlamaData::initialize_model: Response time increased by 4,711,447 ns (4.7 ms), throughput increased by 108,066 ns

The 4.7 ms increase stems from new network operations: common_resolve_redirects() adds 1-5 HTTP GET probes for redirect resolution (100-400 ms typical), and resolve_endpoint() fetches manifests via common_get_hf_file() (100-500 ms). Additional validation and deeper call stacks contribute 50-100 ms. This overhead occurs once per model load at application startup, not during inference.

STL Container Accessors:
Multiple vector accessor functions show 60-226% throughput increases with 30-135 microsecond absolute changes. These appear related to build configuration rather than code logic changes, as the pattern is uniform across different container types in multiple binaries.

Inference Performance Impact

Tokens per Second: No Impact

The core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications in this PR. All performance changes are isolated to model loading and resolution logic executed during initialization. Since the reference shows 7% tokens/second reduction when llama_decode is 2 ms slower, and no inference-path functions are affected here, tokens per second remains unchanged.

Power Consumption Analysis

Binary-Level Impact:

  • build.bin.llama-run: +2.95% (+5,662 nJ) - driven by additional HTTP requests and extended execution time in model resolution
  • build.bin.llama-tts: +0.12% (+262 nJ) - minimal impact from STL accessor changes
  • build.bin.llama-cvector-generator: -0.02% (-53 nJ) - negligible variation
  • Core libraries (libggml, libllama): 0% change

The power increase in llama-run correlates with 3-5 additional HTTP requests during model resolution (1,000-2,000 nJ per request), matching the observed 5,662 nJ increase.

Code Changes Analysis

The refactoring replaces custom download implementations with shared library calls. The resolve_model() function changed from 556 lines with inline HTTP handling to 131 lines delegating to common_download_model() and common_docker_resolve_model(). New functions common_resolve_redirects() and common_set_clean_host_header() implement manual redirect handling to fix Ollama registry compatibility issues where automatic redirect following failed.

The performance regression is a one-time initialization cost, not affecting inference throughput. Subsequent runs benefit from file caching, bypassing the network overhead entirely.

@loci-dev loci-dev force-pushed the main branch 24 times, most recently from 9368c2d to 50d76f4 Compare December 1, 2025 09:13
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 3ba49e2 to 4ba0a8d Compare December 5, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants