Skip to content

perf(metrics): Reduce output token counting chunks from ~1000 to ~10#1373

Merged
yamadashy merged 1 commit intomainfrom
perf/optimize-output-token-chunk-size
Apr 3, 2026
Merged

perf(metrics): Reduce output token counting chunks from ~1000 to ~10#1373
yamadashy merged 1 commit intomainfrom
perf/optimize-output-token-chunk-size

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Apr 3, 2026

The parallel token counting path in calculateOutputMetrics used CHUNK_SIZE = 1000 as the number of chunks, creating ~1KB chunks for 1MB output. Each chunk dispatched a worker task with ~0.5ms overhead (serialization, scheduling, callback resolution), totaling ~500ms of overhead that dominated the actual tokenization work (~50ms).

Replace with TARGET_CHARS_PER_CHUNK = 100_000 so chunks are sized by content rather than count. A 1MB output now produces ~10 chunks instead of ~1000, reducing worker round-trip overhead by ~99%.

Chunks Overhead Tokenization
Before ~1000 ~500ms ~50ms
After ~10 ~5ms ~50ms

Checklist

  • Run npm run test
  • Run npm run lint

Open with Devin

CHUNK_SIZE was used as the number of chunks (1000), creating ~1KB chunks
for 1MB output. Each chunk dispatched a worker task with ~0.5ms overhead
for serialization, scheduling, and callback resolution, totaling ~500ms
of overhead that dominated the actual tokenization work.

Replace with TARGET_CHARS_PER_CHUNK (100,000) so chunks are sized by
content rather than count. A 1MB output now produces ~10 chunks instead
of ~1000, reducing worker round-trip overhead by ~99%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

⚡ Performance Benchmark

Latest commit:ed8deff perf(metrics): Reduce output token counting chunks from ~1000 to ~10
Status:✅ Benchmark complete!
Ubuntu:1.63s (±0.02s) → 1.55s (±0.03s) · -0.08s (-4.7%)
macOS:1.34s (±0.44s) → 1.30s (±0.28s) · -0.04s (-2.8%)
Windows:2.00s (±0.09s) → 1.94s (±0.06s) · -0.06s (-3.0%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48aee02c-6598-4b0e-b2dc-e68c71dd1773

📥 Commits

Reviewing files that changed from the base of the PR and between 0382898 and ed8deff.

📒 Files selected for processing (2)
  • src/core/metrics/calculateOutputMetrics.ts
  • tests/core/metrics/calculateOutputMetrics.test.ts

📝 Walkthrough

Walkthrough

The pull request modifies the parallel chunking strategy in calculateOutputMetrics.ts, replacing dynamic chunk size calculations with fixed 100,000-character segments when processing large content. Corresponding test assertions are updated to reflect the deterministic chunking behavior and adjusted token count expectations.

Changes

Cohort / File(s) Summary
Parallel Chunking Logic
src/core/metrics/calculateOutputMetrics.ts
Replaced CHUNK_SIZE-based calculation with fixed TARGET_CHARS_PER_CHUNK (100,000-character) segments; invoked when content.length > MIN_CONTENT_LENGTH_FOR_PARALLEL. Parallel aggregation and error handling paths unchanged.
Test Updates
tests/core/metrics/calculateOutputMetrics.test.ts
Updated parallel processing tests: token result assertions now depend on runtime chunk count; chunk-splitting test expects exactly 11 chunks for ~1.1MB input with each non-final chunk being precisely 100,000 characters. Original content concatenation validation retained.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • yamadashy/repomix#1350: Modifies the same calculateOutputMetrics function's encoding parameter type; sequential work on the same function warrants cross-reference.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main performance optimization: reducing token counting chunks from ~1000 to ~10, which is the core objective of the changeset.
Description check ✅ Passed The description includes comprehensive context (problem statement, solution, performance metrics table) and completes the required checklist items, though the HTML devin-review-badge content is tangential.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/optimize-output-token-chunk-size

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.32%. Comparing base (0382898) to head (ed8deff).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1373      +/-   ##
==========================================
- Coverage   87.32%   87.32%   -0.01%     
==========================================
  Files         115      115              
  Lines        4378     4377       -1     
  Branches     1017     1017              
==========================================
- Hits         3823     3822       -1     
  Misses        555      555              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the parallel processing of output metrics by switching from a fixed number of chunks to a fixed chunk size of 100KB, which reduces worker round-trip overhead. A review comment identifies a potential issue where splitting strings at fixed character counts could break Unicode surrogate pairs and provides a suggestion to ensure splits occur at valid character boundaries.

Comment on lines +30 to 32
for (let i = 0; i < content.length; i += TARGET_CHARS_PER_CHUNK) {
chunks.push(content.slice(i, i + TARGET_CHARS_PER_CHUNK));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Splitting strings using a fixed character count can break Unicode surrogate pairs (e.g., emojis or certain mathematical symbols) if the split occurs between the high and low surrogates. This results in invalid UTF-16 strings being sent to the worker, which may lead to slightly inaccurate token counts or errors depending on how the tokenizer handles malformed input. While the impact is likely small given the 100KB chunk size, it is safer to ensure splits occur at valid character boundaries.

      for (let i = 0; i < content.length; ) {
        let end = Math.min(i + TARGET_CHARS_PER_CHUNK, content.length);
        if (end < content.length && content.charCodeAt(end - 1) >= 0xd800 && content.charCodeAt(end - 1) <= 0xdbff) {
          end--;
        }
        chunks.push(content.slice(i, end));
        i = end;
      }

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 3, 2026

Code Review

Overall: This is a clean, well-motivated performance fix. The change correctly identifies that CHUNK_SIZE = 1000 was being used as a chunk count rather than a chunk size, causing ~1000 worker round-trips with negligible work each. Flipping the axis to a fixed TARGET_CHARS_PER_CHUNK = 100_000 is the right approach — let chunk count be derived from content size, not the other way around.

The naming improvement (CHUNK_SIZETARGET_CHARS_PER_CHUNK) makes the intent self-documenting, and the explanatory comment block is well-written. Tests are updated correctly.

Suggestions

1. Consider a test with non-evenly-divisible content

The chunk-splitting test uses 'a'.repeat(1_100_000) which divides evenly into 11 chunks of exactly 100,000 chars. Adding a test with e.g. 1_150_000 chars would exercise the "last chunk is smaller" remainder case, ensuring the final partial chunk is correctly included.

2. Minor: comment says "~100KB" but the constant is 100,000 characters

Line 6 says "Target ~100KB per chunk" but the constant is 100_000 characters, not bytes. For ASCII content they're equivalent, but for multi-byte UTF-8 content they diverge. Consider "~100K characters per chunk" for precision — though this is very minor.

3. Note: token count is an approximation at chunk boundaries (pre-existing)

Splitting at arbitrary character boundaries can split BPE token sequences, so sum(count(chunk_i)) won't exactly equal count(full_content). This is a pre-existing issue (same with the old 1000-chunk approach) and at 100KB chunks the boundary effect is negligible. No action needed, just worth acknowledging if exact token counts ever become a requirement.

Looks good to merge! 🎉


Reviewed with Claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant