perf(core): Estimate output tokens via sampling with CV-based fallback#1397
perf(core): Estimate output tokens via sampling with CV-based fallback#1397
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduced sampling-based token estimation for very large content outputs that attempts extrapolation before falling back to full tokenization when sampling variance exceeds thresholds. Refactored tokenization logic into separate helper functions and removed the prior parallel-execution decision from main control flow. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
⚡ Performance Benchmark
Details
Historye27732b perf(core): Raise sampling threshold from 500KB to 3MB
23923aa perf(core): Estimate output tokens via sampling with CV-based fallback
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1397 +/- ##
==========================================
+ Coverage 87.40% 87.47% +0.06%
==========================================
Files 116 116
Lines 4392 4431 +39
Branches 1018 1026 +8
==========================================
+ Hits 3839 3876 +37
- Misses 553 555 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Code Review
This pull request introduces a sampling-based estimation mechanism for calculating token counts in large outputs, aimed at improving performance by avoiding full tokenization when content is sufficiently uniform. It includes a coefficient of variation (CV) check to detect heterogeneous content and fall back to full tokenization when necessary. Feedback indicates that the current sampling threshold of 500KB is too low relative to the parallel processing threshold, potentially causing overhead and redundant processing for files in the 500KB to 1MB range; increasing this threshold to 2MB is recommended.
|
|
||
| // Sampling constants for token count estimation on large outputs. | ||
| // Instead of full BPE tokenization, we sample evenly spaced portions and extrapolate. | ||
| const OUTPUT_SAMPLING_THRESHOLD = 500_000; // 500KB - outputs below this are fully tokenized |
There was a problem hiding this comment.
The OUTPUT_SAMPLING_THRESHOLD of 500KB is lower than the MIN_CONTENT_LENGTH_FOR_PARALLEL threshold (1MB). This introduces several inefficiencies for files in the 500KB - 1MB range:
- Increased Overhead: Files in this range were previously processed in a single task. Now, they trigger
tryEstimateBySampling, which launches up to 10 parallel tasks. The overhead of worker communication for 10 small tasks likely outweighs any benefit, especially since the total amount of data tokenized remains the same or even increases due to overlaps. - Overlapping Samples: Since
OUTPUT_SAMPLE_COUNT * OUTPUT_SAMPLE_SIZEis 1MB, any file smaller than 1MB will have overlapping samples (because thestridewill be less than the sample size). This results in redundant tokenization of the same characters across different worker tasks. - Inaccuracy Risk: For files where sampling doesn't actually reduce the workload (e.g., 600KB), you are introducing potential estimation error for no performance gain.
Consider increasing OUTPUT_SAMPLING_THRESHOLD to a value where sampling provides a significant reduction in work, such as 2MB.
| const OUTPUT_SAMPLING_THRESHOLD = 500_000; // 500KB - outputs below this are fully tokenized | |
| const OUTPUT_SAMPLING_THRESHOLD = 2_000_000; // 2MB - outputs below this are fully tokenized |
Deploying repomix with
|
| Latest commit: |
309fbf2
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://993ae52b.repomix.pages.dev |
| Branch Preview URL: | https://perf-sampling-estimation-wit.repomix.pages.dev |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/core/metrics/calculateOutputMetrics.test.ts (1)
192-272: Please add a low-CV aliasing regression.These cases cover uniform data and intentionally noisy samples, but they do not exercise the deterministic aliasing case where every 100KB sample lands on the same phase and CV still looks healthy. An alternating-band fixture would lock in the fix for that failure mode.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/metrics/calculateOutputMetrics.test.ts` around lines 192 - 272, Add a new test that reproduces the low-CV aliasing regression by constructing content made of repeating alternating-band blocks (e.g., 100KB block A then 100KB block B repeated) so every 100KB sample lands on the same phase and produces an artificially low CV; call calculateOutputMetrics with a mock TaskRunner (mocking TokenCountTask run and using the task.path sample naming that includes '-sample-') where sample runs return a consistent token/char ratio (so sampling appears stable) but full-chunk runs return a different ratio, then assert that the function detects aliasing and falls back to full tokenization (i.e., the mock runCallCount should be >10 and the final result should reflect the full-chunk ratio). Ensure the test references calculateOutputMetrics, TokenCountTask, and uses the same sample path pattern ('-sample-') so it targets the sampling logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/core/metrics/calculateOutputMetrics.ts`:
- Around line 34-42: The branch in calculateOutputMetrics.ts that returns an
extrapolated numeric value from tryEstimateBySampling hides that the count is
estimated; change the public contract returned by calculateOutputMetrics (and
values produced by tryEstimateBySampling and fullTokenize) from a bare number to
a small result object such as { count: number, estimated: boolean } (or add an
explicit isEstimated flag) so callers can distinguish exact vs extrapolated
counts, update calculateMetrics.ts to aggregate using .count and to
track/propagate estimation metadata, and ensure fullTokenize returns
estimated=false while tryEstimateBySampling returns estimated=true when
extrapolating.
- Around line 106-149: The sampling start calculation using start = i * stride
biases samples to each band edge and can miss phase/resonance patterns; change
the sampling in the Array.from mapping (where stride, start, and sampleContent
are computed inside the sampleResults creation) to distribute starts across the
full [0, content.length - OUTPUT_SAMPLE_SIZE] range instead of using i * stride
— e.g., compute maxStart = Math.max(0, content.length - OUTPUT_SAMPLE_SIZE) and
set start = Math.floor(((i + 0.5) / sampleCount) * maxStart) (or add a small
per-bucket jitter) to center/jitter each bucket, keeping the rest of the logic
(validSamples, ratios, cv) unchanged and ensuring start is clamped to valid
bounds.
---
Nitpick comments:
In `@tests/core/metrics/calculateOutputMetrics.test.ts`:
- Around line 192-272: Add a new test that reproduces the low-CV aliasing
regression by constructing content made of repeating alternating-band blocks
(e.g., 100KB block A then 100KB block B repeated) so every 100KB sample lands on
the same phase and produces an artificially low CV; call calculateOutputMetrics
with a mock TaskRunner (mocking TokenCountTask run and using the task.path
sample naming that includes '-sample-') where sample runs return a consistent
token/char ratio (so sampling appears stable) but full-chunk runs return a
different ratio, then assert that the function detects aliasing and falls back
to full tokenization (i.e., the mock runCallCount should be >10 and the final
result should reflect the full-chunk ratio). Ensure the test references
calculateOutputMetrics, TokenCountTask, and uses the same sample path pattern
('-sample-') so it targets the sampling logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5d9c24ee-a764-4e26-9d42-e36f127b01f8
📒 Files selected for processing (2)
src/core/metrics/calculateOutputMetrics.tstests/core/metrics/calculateOutputMetrics.test.ts
| if (content.length > OUTPUT_SAMPLING_THRESHOLD) { | ||
| // For large outputs, try sampling estimation first | ||
| const estimated = await tryEstimateBySampling(content, encoding, path, deps); | ||
| if (estimated !== null) { | ||
| result = estimated; | ||
| } else { | ||
| // Sampling variance too high, fall back to full tokenization | ||
| result = await fullTokenize(content, encoding, path, deps); | ||
| } |
There was a problem hiding this comment.
Surface when this count is estimated.
This branch now returns an extrapolated value through the same Promise<number> contract, but src/core/metrics/calculateMetrics.ts still reduces these numbers into totalTokens as if they were exact. That silently changes the semantics of large-output metrics for every downstream caller. Please propagate estimate metadata or keep the exact path on the public result.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/metrics/calculateOutputMetrics.ts` around lines 34 - 42, The branch
in calculateOutputMetrics.ts that returns an extrapolated numeric value from
tryEstimateBySampling hides that the count is estimated; change the public
contract returned by calculateOutputMetrics (and values produced by
tryEstimateBySampling and fullTokenize) from a bare number to a small result
object such as { count: number, estimated: boolean } (or add an explicit
isEstimated flag) so callers can distinguish exact vs extrapolated counts,
update calculateMetrics.ts to aggregate using .count and to track/propagate
estimation metadata, and ensure fullTokenize returns estimated=false while
tryEstimateBySampling returns estimated=true when extrapolating.
| const stride = Math.floor(content.length / sampleCount); | ||
|
|
||
| const sampleResults = await Promise.all( | ||
| Array.from({ length: sampleCount }, (_, i) => { | ||
| const start = i * stride; | ||
| const sampleContent = content.slice(start, start + OUTPUT_SAMPLE_SIZE); | ||
| return deps.taskRunner | ||
| .run({ | ||
| content: sampleContent, | ||
| encoding, | ||
| path: path ? `${path}-sample-${i}` : undefined, | ||
| }) | ||
| .then((tokens) => ({ chars: sampleContent.length, tokens })); | ||
| }), | ||
| ); | ||
|
|
||
| const validSamples = sampleResults.filter((s) => s.tokens > 0 && s.chars > 0); | ||
| if (validSamples.length < 2) { | ||
| return null; | ||
| } | ||
|
|
||
| // Compute per-sample chars/token ratios and check coefficient of variation (CV = stddev / mean). | ||
| // High CV indicates the content is heterogeneous (e.g. mixed CJK/ASCII, or periodic structure | ||
| // resonating with the sample stride), making the extrapolation unreliable. | ||
| const ratios = validSamples.map((s) => s.chars / s.tokens); | ||
| const mean = ratios.reduce((sum, r) => sum + r, 0) / ratios.length; | ||
| const variance = ratios.reduce((sum, r) => sum + (r - mean) ** 2, 0) / ratios.length; | ||
| const cv = Math.sqrt(variance) / mean; | ||
|
|
||
| if (cv > SAMPLING_CV_THRESHOLD) { | ||
| logger.trace( | ||
| `Sampling CV ${cv.toFixed(3)} exceeds threshold ${SAMPLING_CV_THRESHOLD}, falling back to full tokenization`, | ||
| ); | ||
| return null; | ||
| } | ||
|
|
||
| // Extrapolate total token count from the overall sample ratio | ||
| const totalSampleTokens = validSamples.reduce((sum, s) => sum + s.tokens, 0); | ||
| const totalSampleChars = validSamples.reduce((sum, s) => sum + s.chars, 0); | ||
| const estimated = Math.round((content.length / totalSampleChars) * totalSampleTokens); | ||
|
|
||
| logger.trace( | ||
| `Estimated output tokens from ${validSamples.length} samples: ${estimated} (CV=${cv.toFixed(3)}, ${(totalSampleChars / totalSampleTokens).toFixed(2)} chars/token)`, | ||
| ); |
There was a problem hiding this comment.
The current sample placement can miss the exact resonance case the CV gate is supposed to catch.
start = i * stride always samples the leading edge of each band and can leave the tail unsampled. On alternating 100KB ASCII/CJK blocks, every sample can land on the same phase, CV stays near zero, and the estimate is still badly wrong. Please distribute starts across the full [0, content.length - OUTPUT_SAMPLE_SIZE] range—ideally centered or jittered per bucket—before relying on CV as the safety check.
🩹 One way to de-bias the sample starts
- const stride = Math.floor(content.length / sampleCount);
+ const lastStart = Math.max(0, content.length - OUTPUT_SAMPLE_SIZE);
+ const bucketSize = content.length / sampleCount;
const sampleResults = await Promise.all(
Array.from({ length: sampleCount }, (_, i) => {
- const start = i * stride;
+ const rawStart =
+ i === 0
+ ? 0
+ : i === sampleCount - 1
+ ? lastStart
+ : (i + 0.5) * bucketSize - OUTPUT_SAMPLE_SIZE / 2;
+ const start = Math.max(0, Math.min(lastStart, Math.round(rawStart)));
const sampleContent = content.slice(start, start + OUTPUT_SAMPLE_SIZE);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/core/metrics/calculateOutputMetrics.ts` around lines 106 - 149, The
sampling start calculation using start = i * stride biases samples to each band
edge and can miss phase/resonance patterns; change the sampling in the
Array.from mapping (where stride, start, and sampleContent are computed inside
the sampleResults creation) to distribute starts across the full [0,
content.length - OUTPUT_SAMPLE_SIZE] range instead of using i * stride — e.g.,
compute maxStart = Math.max(0, content.length - OUTPUT_SAMPLE_SIZE) and set
start = Math.floor(((i + 0.5) / sampleCount) * maxStart) (or add a small
per-bucket jitter) to center/jitter each bucket, keeping the rest of the logic
(validSamples, ratios, cv) unchanged and ensuring start is clamped to valid
bounds.
Code ReviewOverall this is a well-structured PR with clean code, good separation of concerns ( Key Findings1. Wasted work on CV fallback (Medium) When Consider an early-exit strategy: run 2–3 pilot samples first, check CV, and only commit to all 10 if the pilot looks stable. This would cap the wasted work to 2–3 calls instead of 10. 2. Benchmark results need validation at 3MB threshold (Medium) The benchmark on commit 1 (500KB threshold) showed regressions on Ubuntu (+6.6%) and macOS (+8.5%). The threshold was raised to 3MB to escape this regression, but there are no benchmark results confirming the 3MB threshold actually improves performance for real-world large outputs. The repomix self-pack benchmark likely produces output well under 3MB, meaning the optimization never fires in the benchmark. It would be valuable to benchmark against a genuinely large mono-repo output (>3MB) to validate the win. 3. Last segment is never sampled (Low) DetailsWith Anchoring the last sample to 4. Test assertion could be tighter (Low) DetailsIn the "should fall back to full tokenization when sampling CV is too high" test, 5. Missing test for DetailsCodecov flags 3 uncovered lines. The What Looks Good
SummaryThe approach is sound for genuinely large outputs. The main concern is validating that the 3MB threshold actually produces measurable speedups for real-world large repos, and mitigating the wasted-work cost on the CV fallback path. The code quality itself is high. 🤖 Generated with Claude Code |
For outputs larger than 500KB, estimate the total token count by sampling 10 evenly spaced 100KB portions and extrapolating the chars-per-token ratio. This avoids running full BPE tokenization on the entire output. To guard against worst-case scenarios (periodic structure resonating with the sample stride, or mixed CJK/ASCII content), compute the coefficient of variation (CV) of per-sample chars/token ratios. If CV exceeds 0.15, fall back to full tokenization to maintain accuracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 500KB threshold caused regression on Ubuntu (+6.6%) and macOS (+8.5%) because outputs in the 500KB-1MB range went from 1 worker call (direct processing) to 10 calls (sampling), adding significant overhead. Raising the threshold to 3MB ensures sampling only kicks in when full parallel chunking would require 30+ tasks, making the reduction to 10 samples worthwhile. The else branch now uses fullTokenize to preserve the existing parallel chunking path for 1-3MB outputs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e27732b to
309fbf2
Compare
|
Closing: sampling-based estimation regresses performance on Ubuntu/macOS. The parallel worker pool is already efficient enough on Unix that reducing 39 chunks to 10 samples doesn't help — it only adds overhead. Windows benefits (-18.7%) but doesn't justify the Unix regression (+7-15%). May revisit with a different approach (e.g., platform-aware thresholds or single-sample ratio estimation). |
For outputs larger than 500KB, estimate the total token count by sampling 10 evenly spaced 100KB portions and extrapolating the chars-per-token ratio. This avoids running full BPE tokenization on the entire output.
To guard against worst-case scenarios (periodic structure resonating with the sample stride, or mixed CJK/ASCII content), compute the coefficient of variation (CV) of per-sample chars/token ratios. If CV exceeds 0.15, fall back to full tokenization to maintain accuracy.
How it works
Worst-case handling
Checklist
npm run testnpm run lint