UPSTREAM PR #17632: common : compute average token length from vocabulary #382

loci-dev · 2025-12-01T04:47:19Z

Replace hardcoded token length (8 bytes) in common_sampler_prev_str() with dynamic computation based on actual vocabulary.

Problem

common/sampling.cpp assumes 8 bytes per token when pre-allocating memory:

result.reserve(n * 8);  // Hardcoded assumption

This is inaccurate for many models - some average 6 bytes (Phi-3), others 10 bytes (Command-R).

Solution

Sample up to 1000 tokens from vocabulary to compute average length. Cache result in static variable for zero runtime overhead.

Testing

Tested with multiple vocabularies:

Model	Vocab Size	Avg Length	Memory Impact
LLaMA SPM	32K	6 bytes	-25% (saves memory)
Command-R	256K	10 bytes	+25% (prevents realloc)
DeepSeek	32K	7 bytes	-12.5% (saves memory)
Phi-3	32K	6 bytes	-25% (saves memory)

Builds cleanly, no performance regression.

Replace hardcoded token length (8) with dynamic computation based on actual vocabulary. Sample up to 1000 tokens to determine average length, cache result in static variable for one-time cost. Implementation: - Add compute_avg_token_length() helper function - Sample evenly across vocabulary (max 1000 tokens or 10%) - Use static caching to compute only once - Fallback to 8 if computation fails Benefits: - Adapts to any vocabulary automatically - Improves memory allocation accuracy (±25% depending on model) - No runtime overhead after initial computation - Backward compatible with existing models

loci-agentic-ai · 2025-12-01T05:29:08Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #382

Analysis Scope: Single file modification (common/sampling.cpp) affecting the common_sampler_prev_str function across 7 binaries.

Overview

PR #382 replaces a hardcoded 8-byte token length assumption with dynamic vocabulary sampling to compute average token length. The change adds vocabulary retrieval and sampling logic that executes on every function call, introducing 440 ns average overhead per invocation.

Key Findings

Most-Impacted Function:

common_sampler_prev_str shows consistent degradation across all affected binaries:

llama-cvector-generator: +441 ns throughput (342 ns → 783 ns)
llama-tts: +442 ns throughput (353 ns → 795 ns)
llama-run: +431 ns throughput (345 ns → 776 ns)
llama-bench: +420 ns throughput (345 ns → 765 ns)
llama-tokenize: +425 ns throughput (353 ns → 778 ns)
llama-quantize: +406 ns throughput (342 ns → 748 ns)
llama-gguf-split: +400 ns throughput (357 ns → 757 ns)

The overhead stems from three additional API calls (llama_get_model, llama_model_get_vocab, llama_vocab_n_tokens) executed on every invocation, plus compiler optimization disruption that prevents function inlining. CFG analysis reveals the code path expanded from 80 to 140 instructions, crossing instruction cache boundaries and increasing memory access patterns.

Inference Impact:

The modified function is not in the inference critical path. Functions llama_decode, llama_encode, and llama_tokenize show no changes in response time or throughput. Token generation performance remains unaffected as common_sampler_prev_str handles display and logging of token history, not token processing itself. Tokens per second metrics are unchanged.

Power Consumption:

Minor increases observed across affected binaries:

llama-gguf-split: +83 nJ (+0.267%)
llama-tokenize: +70 nJ (+0.235%)
llama-quantize: +78 nJ (+0.232%)
llama-bench: +107 nJ (+0.228%)
llama-cvector-generator: +122 nJ (+0.055%)
llama-tts: +51 nJ (+0.023%)

The power increases correlate with throughput time increases but remain under 0.3% across all binaries.

loci-dev temporarily deployed to PROD__AL_DEMO December 1, 2025 04:47 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 6eae205 to 565a9d5 Compare December 3, 2025 12:15

loci-dev force-pushed the main branch 30 times, most recently from e8163f9 to 10ad295 Compare December 7, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17632: common : compute average token length from vocabulary #382

UPSTREAM PR #17632: common : compute average token length from vocabulary #382

Uh oh!

loci-dev commented Dec 1, 2025

Uh oh!

loci-agentic-ai bot commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UPSTREAM PR #17632: common : compute average token length from vocabulary #382

Are you sure you want to change the base?

UPSTREAM PR #17632: common : compute average token length from vocabulary #382

Uh oh!

Conversation

loci-dev commented Dec 1, 2025

Problem

Solution

Testing

Uh oh!

loci-agentic-ai bot commented Dec 1, 2025

Performance Analysis Summary: PR #382

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants