perf(metrics): Reduce output token counting chunks from ~1000 to ~10#1373
perf(metrics): Reduce output token counting chunks from ~1000 to ~10#1373
Conversation
CHUNK_SIZE was used as the number of chunks (1000), creating ~1KB chunks for 1MB output. Each chunk dispatched a worker task with ~0.5ms overhead for serialization, scheduling, and callback resolution, totaling ~500ms of overhead that dominated the actual tokenization work. Replace with TARGET_CHARS_PER_CHUNK (100,000) so chunks are sized by content rather than count. A 1MB output now produces ~10 chunks instead of ~1000, reducing worker round-trip overhead by ~99%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
⚡ Performance Benchmark
Details
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe pull request modifies the parallel chunking strategy in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1373 +/- ##
==========================================
- Coverage 87.32% 87.32% -0.01%
==========================================
Files 115 115
Lines 4378 4377 -1
Branches 1017 1017
==========================================
- Hits 3823 3822 -1
Misses 555 555 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Code Review
This pull request optimizes the parallel processing of output metrics by switching from a fixed number of chunks to a fixed chunk size of 100KB, which reduces worker round-trip overhead. A review comment identifies a potential issue where splitting strings at fixed character counts could break Unicode surrogate pairs and provides a suggestion to ensure splits occur at valid character boundaries.
| for (let i = 0; i < content.length; i += TARGET_CHARS_PER_CHUNK) { | ||
| chunks.push(content.slice(i, i + TARGET_CHARS_PER_CHUNK)); | ||
| } |
There was a problem hiding this comment.
Splitting strings using a fixed character count can break Unicode surrogate pairs (e.g., emojis or certain mathematical symbols) if the split occurs between the high and low surrogates. This results in invalid UTF-16 strings being sent to the worker, which may lead to slightly inaccurate token counts or errors depending on how the tokenizer handles malformed input. While the impact is likely small given the 100KB chunk size, it is safer to ensure splits occur at valid character boundaries.
for (let i = 0; i < content.length; ) {
let end = Math.min(i + TARGET_CHARS_PER_CHUNK, content.length);
if (end < content.length && content.charCodeAt(end - 1) >= 0xd800 && content.charCodeAt(end - 1) <= 0xdbff) {
end--;
}
chunks.push(content.slice(i, end));
i = end;
}
Code ReviewOverall: This is a clean, well-motivated performance fix. The change correctly identifies that The naming improvement ( Suggestions1. Consider a test with non-evenly-divisible contentThe chunk-splitting test uses 2. Minor: comment says "~100KB" but the constant is 100,000 charactersLine 6 says "Target ~100KB per chunk" but the constant is 3. Note: token count is an approximation at chunk boundaries (pre-existing)Splitting at arbitrary character boundaries can split BPE token sequences, so Looks good to merge! 🎉 Reviewed with Claude |
The parallel token counting path in
calculateOutputMetricsusedCHUNK_SIZE = 1000as the number of chunks, creating ~1KB chunks for 1MB output. Each chunk dispatched a worker task with ~0.5ms overhead (serialization, scheduling, callback resolution), totaling ~500ms of overhead that dominated the actual tokenization work (~50ms).Replace with
TARGET_CHARS_PER_CHUNK = 100_000so chunks are sized by content rather than count. A 1MB output now produces ~10 chunks instead of ~1000, reducing worker round-trip overhead by ~99%.Checklist
npm run testnpm run lint