Skip to content

perf(core): Automated performance tuning by Claude#1377

Draft
yamadashy wants to merge 16 commits intomainfrom
perf/auto-perf-tuning-0403
Draft

perf(core): Automated performance tuning by Claude#1377
yamadashy wants to merge 16 commits intomainfrom
perf/auto-perf-tuning-0403

Conversation

@yamadashy
Copy link
Copy Markdown
Owner

@yamadashy yamadashy commented Apr 3, 2026

Summary

Multiple performance optimizations targeting file I/O overhead, IPC round-trips, worker pool lifecycle, worker initialization latency, search phase I/O contention, output token estimation, and pipeline parallelism. Combined improvement of ~39% on end-to-end CLI execution.

Changes in this PR

1. Batch metrics token counting (calculateSelectiveFileMetrics.ts, calculateMetricsWorker.ts)

  • Groups files into batches of 50 before dispatching to worker threads
  • Reduces IPC round-trips by ~95% (990 → 20 for a typical repo)
  • Per-item error handling prevents one bad file from killing an entire batch

2. Replace regex with manual scan in truncateBase64 (truncateBase64.ts)

  • Replaces expensive regex patterns with character-by-character scanning for base64 detection

3. Non-blocking worker pool cleanup (packager.ts, securityCheck.ts)

  • Worker pool cleanup() calls are now fire-and-forget, eliminating shutdown overhead

4. Skip redundant output tokenization (calculateMetrics.ts)

  • When all files are individually tokenized (tokenCountTree enabled), estimates output tokens from file token sums + overhead ratio instead of re-tokenizing the entire output

5. Reduce file I/O overhead (fileRead.ts, fileSearch.ts)

  • Eliminated per-file stat() syscall — checks buffer.length after readFile() instead
  • Uses git ls-files for gitignore filtering (~10ms) instead of globby's JS parser (~250ms)
  • Results intersected with git-visible file set; falls back to globby when git unavailable

6. Move metrics worker warmup before file search (packager.ts)

  • Starts gpt-tokenizer loading earlier to maximize pipeline overlap with search, collection, and security phases

7. Pre-initialize security worker pool (packager.ts, securityCheck.ts, processConcurrency.ts)

  • Creates security worker threads eagerly (minThreads = maxThreads) via new eagerWorkers option
  • Overlaps @secretlint/core module loading (~150ms) with file collection I/O
  • Releases security workers immediately after check to free CPU for metrics

8. Reduce search I/O contention (packager.ts, fileSearch.ts)

  • Defers security worker pool creation to after search, reducing CPU/IO contention during filesystem traversal
  • Parallelizes file globby and directory globby within searchFiles when includeEmptyDirectories is enabled

9. Estimate output tokens via sampling (calculateMetrics.ts)

  • For large outputs (>500KB), samples 10 evenly-spaced 100KB portions and extrapolates total tokens
  • Avoids full BPE tokenization of multi-megabyte output strings (~150-200ms savings)

10. Skip globby via git ls-files + picomatch fast path (fileSearch.ts)

  • When include patterns are default (**/*) and no nested ignore files exist, skips globby filesystem traversal entirely
  • Filters git ls-files output directly with picomatch (~70ms savings)

11. Optimistic security pipeline + lazy module loading (packager.ts, fileSearch.ts, fileRead.ts)

  • Restructures pipeline to overlap security check with output generation and metrics calculation
  • In the common case (>95% of runs with no suspicious files), output string is generated optimistically while security runs in parallel
  • Output is only written to disk/stdout/clipboard AFTER security confirms no suspicious files
  • If security finds issues (rare), output is regenerated with filtered files
  • Lazy-loads globby (~50ms) and jschardet/iconv-lite (~25ms) to reduce startup overhead
  • Pipeline change: Search → Collect+Git → Process → [Security ‖ Output+Metrics] (was: Search → Collect+Git → [Security + Process] → Output+Metrics)

12. Skip unnecessary content scans in createRenderContext (outputGenerate.ts)

  • Skips markdown delimiter calculation and line count computation when not needed by the output style

Benchmark Results

End-to-end CLI execution (repomix on itself, 1012 files, default config, 20 runs):

Metric Baseline Optimized Improvement
Median 2160ms 1316ms -39%
Mean 2158ms 1324ms -39%

Checklist

  • Run npm run test (1101 tests passing)
  • Run npm run lint (0 errors)

https://claude.ai/code/session_01EXHxiny9nuEy8HrdP6d9Em

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

⚡ Performance Benchmark

Latest commit:ae547ec Merge remote-tracking branch 'origin/perf/auto-perf-tuning-0403' into perf/auto-perf-tuning-0403
Status:✅ Benchmark complete!
Ubuntu:1.53s (±0.02s) → 0.99s (±0.03s) · -0.54s (-35.4%)
macOS:0.99s (±0.13s) → 0.71s (±0.13s) · -0.28s (-28.3%)
Windows:1.91s (±0.01s) → 1.17s (±0.02s) · -0.74s (-38.7%)
Details
  • Packing the repomix repository with node bin/repomix.cjs
  • Warmup: 2 runs (discarded), interleaved execution
  • Measurement: 20 runs / 30 on macOS (median ± IQR)
  • Workflow run
History

dcc8452 perf(core): Overlap security check with output generation via optimistic pipeline

Ubuntu:1.52s (±0.02s) → 0.99s (±0.02s) · -0.53s (-34.6%)
macOS:0.88s (±0.05s) → 0.64s (±0.04s) · -0.24s (-27.5%)
Windows:2.68s (±0.66s) → 1.56s (±0.35s) · -1.12s (-41.7%)

763fc00 perf(core): Overlap security check with output generation via optimistic pipeline

Ubuntu:1.51s (±0.03s) → 0.98s (±0.03s) · -0.53s (-35.0%)
macOS:0.87s (±0.10s) → 0.63s (±0.04s) · -0.24s (-27.4%)
Windows:1.93s (±0.04s) → 1.18s (±0.02s) · -0.74s (-38.6%)

1af91bf perf(core): Overlap security check with output generation via optimistic pipeline

Ubuntu:1.59s (±0.02s) → 1.03s (±0.02s) · -0.55s (-34.7%)
macOS:0.89s (±0.04s) → 0.67s (±0.06s) · -0.22s (-25.1%)
Windows:2.01s (±0.52s) → 1.25s (±0.25s) · -0.76s (-37.8%)

1f5ac10 perf(core): Skip globby filesystem traversal via git ls-files + picomatch fast path

Ubuntu:1.55s (±0.06s) → 1.05s (±0.03s) · -0.51s (-32.7%)
macOS:1.34s (±0.29s) → 1.04s (±0.31s) · -0.31s (-22.8%)
Windows:2.44s (±0.07s) → 1.52s (±0.03s) · -0.92s (-37.9%)

07e082a perf(core): Estimate output tokens via sampling for large outputs

Ubuntu:1.59s (±0.05s) → 1.18s (±0.04s) · -0.41s (-25.6%)
macOS:1.08s (±0.16s) → 0.87s (±0.14s) · -0.21s (-19.1%)
Windows:1.62s (±0.36s) → 1.21s (±0.16s) · -0.41s (-25.6%)

a74f424 perf(core): Reduce search I/O contention by deferring security worker init and parallelizing globby

Ubuntu:1.52s (±0.03s) → 1.33s (±0.02s) · -0.19s (-12.6%)
macOS:0.88s (±0.06s) → 0.91s (±0.04s) · +0.03s (+2.9%)
Windows:1.97s (±0.03s) → 1.80s (±0.05s) · -0.17s (-8.4%)

149d995 perf(core): Reduce security worker pool size to avoid thread oversubscription

Ubuntu:1.56s (±0.02s) → 1.33s (±0.03s) · -0.23s (-15.0%)
macOS:0.86s (±0.03s) → 0.90s (±0.04s) · +0.03s (+3.8%)
Windows:1.99s (±0.06s) → 1.79s (±0.05s) · -0.20s (-10.0%)

bea2980 perf(security): Pre-initialize security worker pool to overlap module loading with file I/O

Ubuntu:1.47s (±0.03s) → 1.28s (±0.03s) · -0.19s (-13.2%)
macOS:1.03s (±0.19s) → 0.99s (±0.21s) · -0.04s (-3.5%)
Windows:1.95s (±0.02s) → 1.77s (±0.04s) · -0.18s (-9.0%)

7a0ce5d Merge remote-tracking branch 'origin/perf/auto-perf-tuning-0403' into perf/auto-perf-tuning-0403

Ubuntu:1.63s (±0.02s) → 1.44s (±0.03s) · -0.19s (-11.6%)
macOS:1.09s (±0.47s) → 1.10s (±0.43s) · +0.01s (+0.5%)
Windows:1.92s (±0.02s) → 1.75s (±0.03s) · -0.16s (-8.6%)

e0000fb Merge remote-tracking branch 'origin/perf/auto-perf-tuning-0403' into perf/auto-perf-tuning-0403

Ubuntu:1.57s (±0.06s) → 1.57s (±0.04s) · -0.00s (-0.1%)
macOS:0.89s (±0.04s) → 0.89s (±0.02s) · -0.01s (-1.0%)
Windows:2.04s (±0.12s) → 2.05s (±0.21s) · +0.01s (+0.5%)

e8cb018 [autofix.ci] apply automated fixes

Ubuntu:1.52s (±0.02s) → 1.52s (±0.02s) · +0.00s (+0.1%)
macOS:0.85s (±0.03s) → 0.85s (±0.03s) · +0.00s (+0.2%)
Windows:1.84s (±0.03s) → 1.83s (±0.04s) · -0.02s (-1.0%)

256784e perf(core): Make worker pool cleanup non-blocking to eliminate shutdown overhead

Ubuntu:1.45s (±0.02s) → 1.44s (±0.03s) · -0.01s (-0.3%)
macOS:1.21s (±0.16s) → 1.17s (±0.18s) · -0.04s (-3.6%)
Windows:1.99s (±0.03s) → 1.97s (±0.05s) · -0.02s (-1.0%)

cc97be3 perf(core): Replace regex with manual scan in truncateBase64

Ubuntu:1.51s (±0.02s) → 1.50s (±0.04s) · -0.00s (-0.3%)
macOS:0.90s (±0.07s) → 0.91s (±0.06s) · +0.01s (+1.0%)
Windows:1.87s (±0.03s) → 1.85s (±0.04s) · -0.02s (-0.9%)

2ac33ec perf(metrics): Batch token counting tasks to reduce IPC overhead

Ubuntu:1.52s (±0.02s) → 1.52s (±0.03s) · +0.00s (+0.0%)
macOS:1.23s (±0.48s) → 1.09s (±0.41s) · -0.14s (-11.4%)
Windows:1.89s (±0.51s) → 1.83s (±0.47s) · -0.05s (-2.7%)

6d12895 Merge remote-tracking branch 'origin/perf/auto-perf-tuning-0403' into perf/auto-perf-tuning-0403

Ubuntu:1.41s (±0.01s) → 0.95s (±0.02s) · -0.45s (-32.3%)
macOS:1.36s (±0.16s) → 0.90s (±0.14s) · -0.46s (-33.8%)
Windows:1.89s (±0.06s) → 1.35s (±0.05s) · -0.55s (-28.8%)

b980d65 Merge remote perf changes and resolve conflicts

Ubuntu:1.59s (±0.03s) → 1.13s (±0.02s) · -0.46s (-29.0%)
macOS:1.22s (±0.16s) → 0.82s (±0.10s) · -0.40s (-32.9%)
Windows:1.84s (±0.03s) → 1.31s (±0.03s) · -0.53s (-28.8%)

1a32a49 Merge remote perf changes and resolve conflicts

Ubuntu:1.60s (±0.05s) → 1.11s (±0.03s) · -0.48s (-30.3%)
macOS:1.42s (±0.18s) → 0.95s (±0.09s) · -0.47s (-32.9%)
Windows:1.83s (±0.03s) → 1.27s (±0.05s) · -0.55s (-30.2%)

70fa880 Merge remote-tracking branch 'origin/perf/auto-perf-tuning-0403' into perf/auto-perf-tuning-0403

Ubuntu:1.50s (±0.03s) → 1.14s (±0.01s) · -0.36s (-24.3%)
macOS:1.04s (±0.14s) → 0.71s (±0.07s) · -0.33s (-31.4%)
Windows:1.84s (±0.04s) → 1.41s (±0.04s) · -0.43s (-23.4%)

0fd8bfe Merge remote-tracking branch 'origin/perf/auto-perf-tuning-0403' into perf/auto-perf-tuning-0403

Ubuntu:1.52s (±0.04s) → 1.16s (±0.03s) · -0.36s (-23.8%)
macOS:1.47s (±0.39s) → 1.05s (±0.17s) · -0.42s (-28.8%)
Windows:1.86s (±0.06s) → 1.42s (±0.04s) · -0.43s (-23.3%)

897e668 test(core): Add test for token estimation fast path in calculateMetrics

Ubuntu:1.56s (±0.02s) → 1.18s (±0.02s) · -0.39s (-24.6%)
macOS:1.45s (±0.33s) → 1.00s (±0.25s) · -0.44s (-30.6%)
Windows:1.82s (±0.04s) → 1.36s (±0.05s) · -0.46s (-25.5%)

e313c5b perf(core): Estimate output token count from file metrics ratio

Ubuntu:1.47s (±0.02s) → 1.10s (±0.01s) · -0.36s (-24.8%)
macOS:0.90s (±0.07s) → 0.63s (±0.06s) · -0.27s (-29.5%)
Windows:1.86s (±0.04s) → 1.39s (±0.03s) · -0.47s (-25.4%)

78d17ba perf(core): Overlap output generation with security check via speculative execution

Ubuntu:1.50s (±0.03s) → 1.51s (±0.03s) · +0.02s (+1.1%)
macOS:1.16s (±0.26s) → 1.09s (±0.32s) · -0.07s (-6.0%)
Windows:2.46s (±0.69s) → 2.53s (±0.81s) · +0.07s (+2.7%)

3b18625 merge: Resolve conflict with remote base64 pre-check, prefer charCodeAt approach

Ubuntu:1.52s (±0.04s) → 1.47s (±0.02s) · -0.04s (-3.0%)
macOS:0.93s (±0.05s) → 0.90s (±0.06s) · -0.03s (-3.6%)
Windows:1.86s (±0.02s) → 1.81s (±0.04s) · -0.05s (-2.5%)

620c90a merge: Resolve conflicts with remote batch metrics implementation

Ubuntu:1.54s (±0.02s) → 1.50s (±0.03s) · -0.04s (-2.5%)
macOS:0.94s (±0.10s) → 0.92s (±0.09s) · -0.02s (-2.3%)
Windows:2.14s (±0.06s) → 2.09s (±0.07s) · -0.05s (-2.4%)

dd6619b merge: Resolve conflicts with remote batch metrics changes

Ubuntu:1.53s (±0.03s) → 1.50s (±0.02s) · -0.04s (-2.3%)
macOS:0.90s (±0.06s) → 0.87s (±0.05s) · -0.03s (-3.3%)
Windows:1.83s (±0.03s) → 1.77s (±0.05s) · -0.06s (-3.2%)

881f907 merge: Resolve conflicts with remote batch metrics changes

Ubuntu:1.53s (±0.01s) → 1.49s (±0.03s) · -0.03s (-2.2%)
macOS:1.33s (±0.17s) → 1.31s (±0.27s) · -0.02s (-1.4%)
Windows:1.89s (±0.04s) → 1.84s (±0.02s) · -0.05s (-2.6%)

4c96b48 Merge branch 'perf/auto-perf-tuning-0403' of http://127.0.0.1:40119/git/yamadashy/repomix into perf/auto-perf-tuning-0403

Ubuntu:1.44s (±0.02s) → 1.37s (±0.02s) · -0.07s (-4.9%)
macOS:0.86s (±0.04s) → 0.84s (±0.06s) · -0.02s (-2.0%)
Windows:1.87s (±0.03s) → 1.78s (±0.02s) · -0.09s (-4.8%)

05801b9 perf(security): Batch security check tasks to reduce IPC overhead

Ubuntu:1.52s (±0.01s) → 1.45s (±0.02s) · -0.08s (-5.0%)
macOS:0.89s (±0.03s) → 0.87s (±0.10s) · -0.02s (-2.6%)
Windows:1.87s (±0.04s) → 1.80s (±0.05s) · -0.07s (-3.5%)

ce60e97 merge(main): Resolve conflicts with main branch warmup changes

Ubuntu:1.52s (±0.02s) → 1.52s (±0.02s) · +0.00s (+0.1%)
macOS:0.91s (±0.11s) → 0.92s (±0.07s) · +0.01s (+0.8%)
Windows:2.24s (±0.32s) → 2.33s (±0.34s) · +0.09s (+3.8%)

85fc646 perf(metrics): Batch token counting tasks to reduce IPC overhead

Ubuntu:1.56s (±0.02s) → 1.54s (±0.02s) · -0.01s (-1.0%)
macOS:0.85s (±0.05s) → 0.85s (±0.03s) · +0.00s (+0.0%)
Windows:2.34s (±0.07s) → 2.34s (±0.05s) · +0.00s (+0.0%)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6c4c4852-18c3-487f-b058-2bb96ab1f6a4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/auto-perf-tuning-0403

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 3, 2026

Deploying repomix with  Cloudflare Pages  Cloudflare Pages

Latest commit: ae547ec
Status: ✅  Deploy successful!
Preview URL: https://9599cd8d.repomix.pages.dev
Branch Preview URL: https://perf-auto-perf-tuning-0403.repomix.pages.dev

View logs

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 83.78378% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.82%. Comparing base (208f492) to head (ae547ec).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/core/packager.ts 70.51% 23 Missing ⚠️
src/core/file/fileSearch.ts 80.72% 16 Missing ⚠️
src/core/metrics/workers/calculateMetricsWorker.ts 0.00% 10 Missing ⚠️
src/core/metrics/calculateMetrics.ts 91.11% 4 Missing ⚠️
src/core/security/securityCheck.ts 72.72% 3 Missing ⚠️
src/core/output/outputGenerate.ts 91.66% 2 Missing ⚠️
src/core/file/fileProcess.ts 50.00% 1 Missing ⚠️
src/core/file/truncateBase64.ts 98.59% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1377      +/-   ##
==========================================
- Coverage   87.40%   86.82%   -0.59%     
==========================================
  Files         116      116              
  Lines        4389     4651     +262     
  Branches     1018     1109      +91     
==========================================
+ Hits         3836     4038     +202     
- Misses        553      613      +60     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces batch processing for token counting to minimize IPC overhead between the main thread and worker threads. Key changes include updating the TokenCountTask interface to support multiple items, modifying the calculateMetricsWorker to process these items in a single pass, and implementing a batching mechanism in calculateSelectiveFileMetrics with a default size of 100. Corresponding updates were made to git diff, git log, and output metrics calculations, along with comprehensive test updates. Feedback suggests that the fixed batch size might lead to under-utilization of the worker pool for medium-sized workloads and recommends considering a more dynamic approach.

// Batch size for grouping files into worker tasks to reduce IPC overhead.
// Each batch is sent as a single message to a worker thread, avoiding
// per-file round-trip costs that dominate when processing many small files.
const BATCH_SIZE = 100;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While a fixed BATCH_SIZE of 100 significantly reduces IPC overhead for large file sets, it may lead to under-utilization of the worker pool for medium-sized sets. For instance, with 150 files on an 8-core machine, only 2 workers will be active. Consider a more dynamic approach or a smaller default if the goal is to maximize parallelization across all available cores for smaller workloads.

@yamadashy yamadashy force-pushed the perf/auto-perf-tuning-0403 branch 3 times, most recently from 1a32a49 to b980d65 Compare April 4, 2026 00:12
Batch file token counting tasks into groups of 50 before dispatching
to worker threads, reducing IPC round-trips by ~95% (e.g. 990 → 20
for a typical repo). This follows the same batching pattern already
applied to security checks in #1380.

Changes:
- Redesign metrics worker to accept batch tasks (TokenCountBatchTask)
  instead of individual items, processing multiple files per IPC
  round-trip
- Update calculateSelectiveFileMetrics to group files into batches
  of 50 before dispatching to worker pool
- Adapt all metrics callers (output, git diff, git log) to use the
  new batch interface
- Update unified worker task inference to distinguish batched metrics
  tasks (items with encoding) from security check tasks (items with
  type)

Benchmark (repomix on its own repo, 990 files, 4 CPU cores):
  Before: 1246ms avg pack time
  After:  1092ms avg pack time
  Improvement: ~155ms savings (12.4%)

The improvement comes from eliminating per-file IPC message passing
overhead. Each round-trip involves structured clone serialization of
file content, message dispatch, and result deserialization. Batching
amortizes this cost across 50 files per message.

https://claude.ai/code/session_019tAfah6yMKnauVTNgK3wyQ
@yamadashy yamadashy force-pushed the perf/auto-perf-tuning-0403 branch from 6d12895 to 2ac33ec Compare April 4, 2026 01:50
claude and others added 10 commits April 4, 2026 03:07
Replace the standalone base64 regex pattern /([A-Za-z0-9+/]{256,}={0,2})/g
with a manual character-by-character scanning algorithm using a lookup table.

The regex was identified via CPU profiling as consuming ~150ms per run on the
main thread when truncateBase64 is enabled, scanning 6.6MB of file content
across ~1000 files. The regex engine's per-character overhead for the {256,}
quantifier made this disproportionately expensive.

The optimized approach:
- Uses a Uint8Array lookup table for O(1) base64 character classification
- Performs a fast first pass to check if any 256+ char runs exist (early exit)
- Builds the result using array parts + join instead of string concatenation
- Replaces regex checks in isLikelyBase64() with charCode comparisons

Benchmark results (repomix repo, 1006 files, 6.6MB content):
- truncateBase64 function: 151ms → 36ms (4.2x faster)
- Full pack() with truncateBase64=true: P50 1522ms → 1420ms (6.7% faster)

The optimization produces byte-identical output (verified across all 1006 files).

https://claude.ai/code/session_017cpLL66Hs2zjm3zZ9Bjori
…wn overhead

Worker pool cleanup (terminating idle worker threads via Tinypool's
destroy) was blocking the return path of pack(), adding ~150ms for
the metrics pool and ~40ms for the security pool to every invocation.
Since all tasks are already complete when cleanup runs, the threads
are idle and termination is pure IPC overhead with no functional
purpose for the caller.

Changed `await taskRunner.cleanup()` to fire-and-forget
`taskRunner.cleanup().catch(() => {})` in four locations:
- packager.ts: metrics pool (finally block, ~150ms saved)
- securityCheck.ts: security pool (finally block, ~40ms saved)
- fileProcess.ts: file processing pool (finally block, affects --compress)
- calculateMetrics.ts: standalone metrics pool (fallback path)

For CLI usage, the process exits shortly after pack() returns, which
terminates any remaining threads via OS cleanup. For library/MCP
usage, threads are still terminated asynchronously and reclaimed.

Benchmark (repomix on its own repo, 1014 files, 4 CPU cores):
  Before: 1811ms avg pack time
  After:  1595ms avg pack time
  Improvement: ~216ms savings (11.9%)

The improvement comes from removing synchronous worker thread
termination from the critical path. Tinypool's destroy() sends
termination messages to each worker, waits for acknowledgment,
and joins the threads—none of which the caller needs to block on
when all tasks have already completed successfully.

https://claude.ai/code/session_01SMHUcwLAmv7mcsNsr8sQFj
…individually tokenized

When tokenCountTree is enabled, all files are individually tokenized for the
token count tree display. The output file is essentially these file contents
wrapped in template markup (XML tags, headers, tree structure). Previously,
the entire output (~3.6MB for this repo) was re-tokenized via worker threads,
duplicating ~95% of the work already done during file tokenization.

Instead, estimate output tokens as:
  output_tokens = sum(file_tokens) + overhead_tokens
where overhead_tokens uses the same chars-per-token ratio observed in file
content. This avoids dispatching ~36 output chunks to worker threads (~207ms
of tokenization work), freeing the worker pool to complete file metrics faster.

The estimation error is negligible (~0.14% for repomix's own repo: 1,011,887
estimated vs 1,010,472 exact). The current chunk-based approach already has
small boundary effects from splitting at arbitrary 100KB positions.

When tokenCountTree is disabled (default), the standard full-output
tokenization path is preserved unchanged.

Benchmark (repomix on its own repo, 989 files, 4 CPU cores, 10 runs each):
  Baseline avg: 1791ms (min 1762, max 1830)
  Optimized avg: 1592ms (min 1546, max 1634)
  Improvement: ~199ms savings (11.1%)

The improvement comes from eliminating redundant tokenization of file
content that was already counted individually. With tokenCountTree enabled,
the worker pool previously processed 20 file batches + 36 output chunks =
56 tasks. Now it processes only 20 file batches, reducing total worker
pool wall time from ~530ms to ~303ms.

https://claude.ai/code/session_0132C7om9T8M2skDqfSh95qW
…stat elimination

Two optimizations that together reduce CLI execution time by ~6.6%:

1. Use `git ls-files` for gitignore filtering instead of globby's JS parser
   - globby's gitignore parsing reads and evaluates .gitignore files at every
     directory level using JavaScript (~250ms for a 1000-file repo)
   - `git ls-files --cached --others --exclude-standard` delegates this to
     git's native C implementation (~10ms)
   - globby still handles include/ignore patterns and .repomixignore support
     (with gitignore:false), and results are intersected with the git file set
   - Falls back to globby's gitignore when git is unavailable

2. Eliminate redundant fs.stat() call before fs.readFile() in file collection
   - Previously each file required stat() (size check) then readFile() = 2 syscalls
   - Now readFile() runs first, then buffer.length is checked = 1 syscall
   - Files exceeding maxFileSize (default 10MB) are rare; the occasional
     oversized read is acceptable for halving syscall count on all files

Benchmark (15 iterations each, median, repomix on itself ~1000 files):
  Baseline:  2288ms (P25: 2234ms, P75: 2669ms)
  Optimized: 2137ms (P25: 2006ms, P75: 2284ms)
  Improvement: ~151ms (6.6%)

Isolated measurements:
  - searchFiles: 239ms → 200ms (-39ms, git ls-files fast path)
  - collectFiles: 256ms → 132ms (-124ms, stat elimination, 50 concurrency)

https://claude.ai/code/session_01BhyEyYSfaJev3zzMjKpj4x
…pipeline overlap

Move the metrics worker pool creation (gpt-tokenizer warmup) from after
searchFiles to the very start of the pack() pipeline. This allows the
expensive gpt-tokenizer module loading in worker threads (~215ms) to
overlap with the file search phase (~33ms), reducing the critical path
blocking time.

Previously, the warmup overlapped only with file collection and security
check stages (~150ms total), leaving ~65ms on the critical path. Now it
overlaps with file search + collection + security check (~185ms total),
reducing blocking to ~30ms.

Since the actual file count is unknown before searchFiles completes, a
heuristic estimate of 200 tasks is used for worker thread sizing. This
yields 2 workers on most machines, balancing warmup speed (less CPU
contention) with sufficient parallelism. For larger repos (>200 files),
2 workers still provide good throughput since metrics calculation runs
concurrently with output generation.

Benchmark results (20 runs each, packing 126 files from src/):
- Baseline trimmed mean: 457ms
- Optimized trimmed mean: 418ms
- Improvement: 39ms (8.5%)
- Baseline median: 457ms
- Optimized median: 413ms
- Improvement: 44ms (9.6%)

https://claude.ai/code/session_015V48t1u1jZjm6mUkpsV1tv
… loading with file I/O

Create security check worker threads eagerly (minThreads = maxThreads) before
file search begins, so @secretlint/core module loading (~150ms) runs in the
background during the I/O-heavy file search and collection phases.

Previously, security workers were created lazily on the first batch submission,
meaning the expensive @secretlint/core initialization blocked the first security
check batch. Now workers start loading immediately at pool creation and are ready
when security check begins.

Key changes:
- Add `eagerWorkers` option to WorkerOptions to set minThreads = maxThreads
- Add `createSecurityTaskRunner()` that creates pool with eager workers
- Pre-create security pool before file search in packager pipeline
- Pass pre-created task runner through validateFileSafety to runSecurityCheck
- Clean up security pool immediately after security check (before metrics)
- Security pool cleanup also in finally block for error paths

Benchmark (repomix on itself, ~1000 files, 10 runs after warmup, 4 CPU cores):
  Baseline:  Median 1158ms | P25 1136ms | P75 1187ms | Min 1120ms
  Optimized: Median 1103ms | P25 1091ms | P75 1105ms | Min 1063ms
  Improvement: ~5% median, ~7% P75, much tighter variance

The improvement comes from overlapping the ~150ms @secretlint/core module
loading in 4 worker threads with the ~100ms file search (globby I/O) and
~300ms file collection (disk reads), eliminating the module loading from
the critical path.

https://claude.ai/code/session_01JxK3KLc12sGQU7x1rxCcLJ
… init and parallelizing globby

Two optimizations to reduce the file search phase bottleneck:

1. Defer security worker pool creation from before search to after search.
   With both metrics (2 workers) and security (up to 4 workers) pools
   pre-initialized before search, up to 6 worker threads simultaneously
   load heavy modules (@secretlint/core, gpt-tokenizer), creating CPU/IO
   contention with globby's filesystem traversal. Moving security pool
   creation to after search lets @secretlint/core loading (~150ms) overlap
   with the lighter file collection phase instead. Metrics workers are
   still pre-initialized before search since they are needed throughout
   the pipeline.

2. Parallelize file globby and directory globby within searchFiles.
   When includeEmptyDirectories is enabled, the directory search (with
   full gitignore parsing) ran sequentially after file search. Starting
   both concurrently overlaps the two filesystem traversals.

Benchmark (15 iterations, in-process pack(), median):
- Baseline: 1152ms
- Optimized: 1080ms
- Improvement: 72ms (6.3%)

https://claude.ai/code/session_01Ga7a5Qg2mR3ZqFAGe4kk3Q
@yamadashy yamadashy force-pushed the perf/auto-perf-tuning-0403 branch from 149d995 to a74f424 Compare April 4, 2026 22:50
claude added 2 commits April 5, 2026 00:18
For single-part outputs larger than 500KB, estimate the total token count
by sampling 10 evenly spaced 100KB portions of the output and extrapolating
the chars-per-token ratio to the full content. This avoids running BPE
tokenization on the entire multi-MB output through worker threads.

The sampling approach achieves ~0.2% accuracy compared to full tokenization
because the chars-per-token ratio is stable across evenly distributed samples
that capture both markup overhead and file content.

This optimization targets the default config path (tokenCountTree disabled).
When tokenCountTree is enabled, the existing file-token-sum estimation
(added in a prior commit) is used instead. Split outputs still use full
tokenization per part.

Benchmark results (991 files, ~3.9MB output, 4-core machine, 15 runs):

  Before (full output tokenization):
    Trimmed mean: 1.417s

  After (output sampling estimation):
    Trimmed mean: 1.301s
    Improvement: 8.2% (116ms)

  Token count accuracy:
    Exact:     1,034,842 tokens
    Estimated: 1,036,915 tokens (0.2% error)

https://claude.ai/code/session_01EWxSA8Tdwvd2jJrMGsvVJU
…atch fast path

When git ls-files is available and include patterns are default (**/*),
bypass globby's filesystem traversal entirely by filtering git-tracked
files with picomatch. This eliminates ~70ms of directory walking and
pattern matching that globby performs even with gitignore disabled.

The fast path:
1. Uses git ls-files (already fetched) as the file source
2. Reads root .repomixignore and merges patterns with default ignores
3. Uses git ls-files -s to detect symlinks (mode 120000) in parallel
4. Filters with picomatch (same engine as fast-glob) for consistency
5. Falls back to globby when nested .repomixignore/.ignore files exist

Benchmark (repomix repo, ~1030 files, 7 runs, trimmed mean):
- CLI (full config): 1233ms → 1144ms (-89ms, -7.2%)
- pack() without emptyDirs: 770ms → 733ms (-37ms, -4.8%)
- pack() with emptyDirs: neutral (dir globby still runs for empty dir detection)

https://claude.ai/code/session_01WudWSWteuywL5ZfmPWFnXi
@yamadashy yamadashy force-pushed the perf/auto-perf-tuning-0403 branch from 1af91bf to 763fc00 Compare April 5, 2026 03:00
…tic pipeline

Restructure the pack() pipeline to start output generation and metrics
calculation immediately after file processing, without waiting for the
security check to complete. In the common case (>95% of runs), no
suspicious files are found and the optimistic output is correct. In the
rare case where suspicious files are detected, the output is regenerated
with filtered files.

Additionally, lazy-load globby and jschardet/iconv-lite to reduce module
import overhead on the critical startup path. Globby (~50ms) is only
needed when the git-only fast path falls back to filesystem traversal.
jschardet/iconv-lite (~25ms) are only needed for non-UTF-8 files (<1%).

Pipeline change (before):
  Search → Collect+Git → [Security + Process] → Output+Metrics

Pipeline change (after):
  Search → Collect+Git → Process → [Security ‖ Output+Metrics]

Benchmark results (20 runs each, repomix on itself with 1012 files):

Default config (tokenCountTree=false):
  Baseline: 1059ms avg → Optimized: 994ms avg (-6.1%)

Project config (tokenCountTree=50000, all features):
  Baseline: 1319ms avg → Optimized: 1286ms avg (-2.5%)

The default config improvement exceeds the 5% target. The project config
shows smaller relative gains because all-file tokenization dominates the
output+metrics phase, reducing the relative benefit of security overlap.

All 1101 tests pass. No functional changes — output is identical.

https://claude.ai/code/session_012ZhuvmD4C16mcvYy3YthxP
@yamadashy yamadashy force-pushed the perf/auto-perf-tuning-0403 branch from 763fc00 to dcc8452 Compare April 5, 2026 03:01
claude added 2 commits April 5, 2026 04:18
Skip two expensive full-content regex scans in createRenderContext when
their results are unused by the current output format:

- calculateFileLineCounts: scans all file contents to count newlines.
  Not referenced by any Handlebars template (XML, markdown, plain) or
  parsable output generator (XML, JSON). Only used by the skill path,
  so skip for regular output entirely.

- calculateMarkdownDelimiter: scans all file contents for backtick
  sequences. Only used by the markdown template and skill generators.
  Skip for XML, plain, JSON, and parsable-XML output styles.

Also optimized the implementations for when they do run:
- calculateMarkdownDelimiter: replaced flatMap + intermediate array
  allocation with single-pass inline max tracking.
- calculateFileLineCounts: replaced regex-based newline counting
  (which allocated arrays of all matches) with indexOf-based loop.

Benchmark (3000 files / 12MB synthetic TypeScript codebase):
- XML output (default): 1576ms → 1465ms (-111ms, ~7% improvement)
- Micro-benchmark (5000 files / 36MB): 60.6ms of render context
  overhead eliminated entirely for default XML output
- calculateFileLineCounts: 30ms → 13ms (2.4x faster via indexOf)
- calculateMarkdownDelimiter: 39ms → 30ms (1.3x faster, no alloc)
- Correctness verified: all 1096 tests pass

https://claude.ai/code/session_01RoH4sBaaDHvVnZLJzTYGP7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants