UPSTREAM PR #17535: refactor : use common download in tools/run by loci-dev · Pull Request #341 · auroralabs-loci/llama.cpp

loci-dev · 2025-11-26T23:34:05Z

Mirrored from ggml-org/llama.cpp#17535

update the common/download interface to be directly usable by tools/run (removing duplicated code).
fix ollama downloads by implementing manual redirect handling (addressing issues with cpp-httplib).

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

loci-review · 2025-11-27T00:15:06Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #341

Overview

PR #341 refactors download functionality in tools/run by removing 368 lines of duplicated HTTP code and delegating to the centralized common/download interface. The changes introduce manual redirect handling for Ollama downloads and modernize the header API from string-based to type-safe pairs.

Key Findings

Performance-Critical Functions Impact

Model Initialization Path:

LlamaData::resolve_model: Response time increased by 4,710,905 ns (4.7 ms), throughput decreased by 73,977 ns
LlamaData::initialize_model: Response time increased by 4,711,447 ns (4.7 ms), throughput increased by 108,066 ns

The 4.7 ms increase stems from new network operations: common_resolve_redirects() adds 1-5 HTTP GET probes for redirect resolution (100-400 ms typical), and resolve_endpoint() fetches manifests via common_get_hf_file() (100-500 ms). Additional validation and deeper call stacks contribute 50-100 ms. This overhead occurs once per model load at application startup, not during inference.

STL Container Accessors:
Multiple vector accessor functions show 60-226% throughput increases with 30-135 microsecond absolute changes. These appear related to build configuration rather than code logic changes, as the pattern is uniform across different container types in multiple binaries.

Inference Performance Impact

Tokens per Second: No Impact

The core inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications in this PR. All performance changes are isolated to model loading and resolution logic executed during initialization. Since the reference shows 7% tokens/second reduction when llama_decode is 2 ms slower, and no inference-path functions are affected here, tokens per second remains unchanged.

Power Consumption Analysis

Binary-Level Impact:

build.bin.llama-run: +2.95% (+5,662 nJ) - driven by additional HTTP requests and extended execution time in model resolution
build.bin.llama-tts: +0.12% (+262 nJ) - minimal impact from STL accessor changes
build.bin.llama-cvector-generator: -0.02% (-53 nJ) - negligible variation
Core libraries (libggml, libllama): 0% change

The power increase in llama-run correlates with 3-5 additional HTTP requests during model resolution (1,000-2,000 nJ per request), matching the observed 5,662 nJ increase.

Code Changes Analysis

The refactoring replaces custom download implementations with shared library calls. The resolve_model() function changed from 556 lines with inline HTTP handling to 131 lines delegating to common_download_model() and common_docker_resolve_model(). New functions common_resolve_redirects() and common_set_clean_host_header() implement manual redirect handling to fix Ollama registry compatibility issues where automatic redirect following failed.

The performance regression is a one-time initialization cost, not affecting inference throughput. Subsequent runs benefit from file caching, bypassing the network overhead entirely.

angt added 4 commits November 26, 2025 22:26

common : allow custom headers when downloading

925182f

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

common : allow custom oci params

bca24e0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

tools : update run.cpp to use download.cpp

4579a64

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

common : fix redirects

ec02657

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

loci-dev temporarily deployed to PROD__AL_DEMO November 26, 2025 23:34 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 24 times, most recently from 9368c2d to 50d76f4 Compare December 1, 2025 09:13

loci-dev force-pushed the main branch 30 times, most recently from 3ba49e2 to 4ba0a8d Compare December 5, 2025 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17535: refactor : use common download in tools/run#341

UPSTREAM PR #17535: refactor : use common download in tools/run#341
loci-dev wants to merge 4 commits intomainfrom
upstream-PR17535-branch_angt-refactor-download

loci-dev commented Nov 26, 2025

Uh oh!

loci-review bot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Nov 26, 2025

Uh oh!

loci-review bot commented Nov 27, 2025

Performance Analysis Summary: PR #341

Overview

Key Findings

Performance-Critical Functions Impact

Inference Performance Impact

Power Consumption Analysis

Code Changes Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants