perf(metrics): Reduce output token counting chunks from ~1000 to ~10 by yamadashy · Pull Request #1373 · yamadashy/repomix

yamadashy · 2026-04-03T04:58:32Z

The parallel token counting path in calculateOutputMetrics used CHUNK_SIZE = 1000 as the number of chunks, creating ~1KB chunks for 1MB output. Each chunk dispatched a worker task with ~0.5ms overhead (serialization, scheduling, callback resolution), totaling ~500ms of overhead that dominated the actual tokenization work (~50ms).

Replace with TARGET_CHARS_PER_CHUNK = 100_000 so chunks are sized by content rather than count. A 1MB output now produces ~10 chunks instead of ~1000, reducing worker round-trip overhead by ~99%.

	Chunks	Overhead	Tokenization
Before	~1000	~500ms	~50ms
After	~10	~5ms	~50ms

Checklist

Run npm run test
Run npm run lint

CHUNK_SIZE was used as the number of chunks (1000), creating ~1KB chunks for 1MB output. Each chunk dispatched a worker task with ~0.5ms overhead for serialization, scheduling, and callback resolution, totaling ~500ms of overhead that dominated the actual tokenization work. Replace with TARGET_CHARS_PER_CHUNK (100,000) so chunks are sized by content rather than count. A 1MB output now produces ~10 chunks instead of ~1000, reducing worker round-trip overhead by ~99%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-03T04:58:44Z

⚡ Performance Benchmark

Latest commit:	`ed8deff` perf(metrics): Reduce output token counting chunks from ~1000 to ~10
Status:	✅ Benchmark complete!
Ubuntu:	1.63s (±0.02s) → 1.55s (±0.03s) · -0.08s (-4.7%)
macOS:	1.34s (±0.44s) → 1.30s (±0.28s) · -0.04s (-2.8%)
Windows:	2.00s (±0.09s) → 1.94s (±0.06s) · -0.06s (-3.0%)

Details

Packing the repomix repository with node bin/repomix.cjs
Warmup: 2 runs (discarded), interleaved execution
Measurement: 20 runs / 30 on macOS (median ± IQR)
Workflow run

coderabbitai · 2026-04-03T04:58:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48aee02c-6598-4b0e-b2dc-e68c71dd1773

📥 Commits

Reviewing files that changed from the base of the PR and between 0382898 and ed8deff.

📒 Files selected for processing (2)

src/core/metrics/calculateOutputMetrics.ts
tests/core/metrics/calculateOutputMetrics.test.ts

📝 Walkthrough

Walkthrough

The pull request modifies the parallel chunking strategy in calculateOutputMetrics.ts, replacing dynamic chunk size calculations with fixed 100,000-character segments when processing large content. Corresponding test assertions are updated to reflect the deterministic chunking behavior and adjusted token count expectations.

Changes

Cohort / File(s)	Summary
Parallel Chunking Logic `src/core/metrics/calculateOutputMetrics.ts`	Replaced `CHUNK_SIZE`-based calculation with fixed `TARGET_CHARS_PER_CHUNK` (100,000-character) segments; invoked when `content.length > MIN_CONTENT_LENGTH_FOR_PARALLEL`. Parallel aggregation and error handling paths unchanged.
Test Updates `tests/core/metrics/calculateOutputMetrics.test.ts`	Updated parallel processing tests: token result assertions now depend on runtime chunk count; chunk-splitting test expects exactly 11 chunks for ~1.1MB input with each non-final chunk being precisely 100,000 characters. Original content concatenation validation retained.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

yamadashy/repomix#1350: Modifies the same calculateOutputMetrics function's encoding parameter type; sequential work on the same function warrants cross-reference.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main performance optimization: reducing token counting chunks from ~1000 to ~10, which is the core objective of the changeset.
Description check	✅ Passed	The description includes comprehensive context (problem statement, solution, performance metrics table) and completes the required checklist items, though the HTML devin-review-badge content is tangential.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/optimize-output-token-chunk-size

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-03T04:59:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.32%. Comparing base (0382898) to head (ed8deff).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1373      +/-   ##
==========================================
- Coverage   87.32%   87.32%   -0.01%     
==========================================
  Files         115      115              
  Lines        4378     4377       -1     
  Branches     1017     1017              
==========================================
- Hits         3823     3822       -1     
  Misses        555      555

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

gemini-code-assist

Code Review

This pull request optimizes the parallel processing of output metrics by switching from a fixed number of chunks to a fixed chunk size of 100KB, which reduces worker round-trip overhead. A review comment identifies a potential issue where splitting strings at fixed character counts could break Unicode surrogate pairs and provides a suggestion to ensure splits occur at valid character boundaries.

gemini-code-assist · 2026-04-03T05:00:27Z

src/core/metrics/calculateOutputMetrics.ts

+      for (let i = 0; i < content.length; i += TARGET_CHARS_PER_CHUNK) {
+        chunks.push(content.slice(i, i + TARGET_CHARS_PER_CHUNK));
      }


Splitting strings using a fixed character count can break Unicode surrogate pairs (e.g., emojis or certain mathematical symbols) if the split occurs between the high and low surrogates. This results in invalid UTF-16 strings being sent to the worker, which may lead to slightly inaccurate token counts or errors depending on how the tokenizer handles malformed input. While the impact is likely small given the 100KB chunk size, it is safer to ensure splits occur at valid character boundaries.

for (let i = 0; i < content.length; ) { let end = Math.min(i + TARGET_CHARS_PER_CHUNK, content.length); if (end < content.length && content.charCodeAt(end - 1) >= 0xd800 && content.charCodeAt(end - 1) <= 0xdbff) { end--; } chunks.push(content.slice(i, end)); i = end; }

claude · 2026-04-03T05:01:10Z

Code Review

Overall: This is a clean, well-motivated performance fix. The change correctly identifies that CHUNK_SIZE = 1000 was being used as a chunk count rather than a chunk size, causing ~1000 worker round-trips with negligible work each. Flipping the axis to a fixed TARGET_CHARS_PER_CHUNK = 100_000 is the right approach — let chunk count be derived from content size, not the other way around.

The naming improvement (CHUNK_SIZE → TARGET_CHARS_PER_CHUNK) makes the intent self-documenting, and the explanatory comment block is well-written. Tests are updated correctly.

Suggestions

1. Consider a test with non-evenly-divisible content

The chunk-splitting test uses 'a'.repeat(1_100_000) which divides evenly into 11 chunks of exactly 100,000 chars. Adding a test with e.g. 1_150_000 chars would exercise the "last chunk is smaller" remainder case, ensuring the final partial chunk is correctly included.

2. Minor: comment says "~100KB" but the constant is 100,000 characters

Line 6 says "Target ~100KB per chunk" but the constant is 100_000 characters, not bytes. For ASCII content they're equivalent, but for multi-byte UTF-8 content they diverge. Consider "~100K characters per chunk" for precision — though this is very minor.

3. Note: token count is an approximation at chunk boundaries (pre-existing)

Splitting at arbitrary character boundaries can split BPE token sequences, so sum(count(chunk_i)) won't exactly equal count(full_content). This is a pre-existing issue (same with the old 1000-chunk approach) and at 100KB chunks the boundary effect is negligible. No action needed, just worth acknowledging if exact token counts ever become a requirement.

Looks good to merge! 🎉

Reviewed with Claude

devin-ai-integration bot reviewed Apr 3, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

coderabbitai bot approved these changes Apr 3, 2026

View reviewed changes

yamadashy merged commit b66286b into main Apr 3, 2026
75 checks passed

yamadashy deleted the perf/optimize-output-token-chunk-size branch April 3, 2026 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(metrics): Reduce output token counting chunks from ~1000 to ~10#1373

perf(metrics): Reduce output token counting chunks from ~1000 to ~10#1373
yamadashy merged 1 commit intomainfrom
perf/optimize-output-token-chunk-size

yamadashy commented Apr 3, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

github-actions bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

codecov bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

claude bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yamadashy commented Apr 3, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

github-actions bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Performance Benchmark

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

codecov bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 3, 2026

Code Review

Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yamadashy commented Apr 3, 2026 •

edited by devin-ai-integration bot

Loading

github-actions bot commented Apr 3, 2026 •

edited

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

codecov bot commented Apr 3, 2026 •

edited

Loading