Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 6, 2026

Rationale for this change

String operations with regex patterns (match, replace, extract) were recompiling regex patterns on every invocation. This PR implements caching to compile once and reuse.

Benchmark shows roughly 36% performance improvement (2.52s -> 1.61s for 200 operations).

// TODO Cache matcher across invocations (for regex compilation)

// TODO Cache matcher across invocations (for regex compilation)

// TODO Cache replacer across invocations (for regex compilation)

// TODO cache this once per ExtractRegexOptions

What changes are included in this PR?

  • Added CachedOptionsWrapper<T> template for kernel state with caching support
  • Updated MatchSubstringState, ReplaceState, and ExtractRegexState to use caching
  • Modified Exec() methods to call GetOrCreate<Matcher>() instead of direct Matcher::Make()

Are these changes tested?

Yes. All existing tests pass. Benchmark demonstrates measurable performance improvement when same pattern is used across multiple operations.

(Generated by ChatGPT)

Benchmark:

# Benchmark script: Compare WITH vs WITHOUT caching

# Step 1: Measure WITH caching (current implementation)
cd /.../arrow/cpp/build
/usr/bin/time -p ./debug/arrow-compute-scalar-type-test \
  --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \
  --gtest_repeat=200 \
  --gtest_brief=1

# Step 2: Temporarily remove caching
cd /.../arrow/cpp
git stash push -m "Temp for benchmark" src/arrow/compute/kernels/scalar_string_ascii.cc

# Step 3: Rebuild WITHOUT caching
cd build
touch ../src/arrow/compute/kernels/scalar_string_ascii.cc
cmake --build .

# Step 4: Measure WITHOUT caching (reverted to old TODO code)
/usr/bin/time -p ./debug/arrow-compute-scalar-type-test \
  --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \
  --gtest_repeat=200 \
  --gtest_brief=1

# Step 5: Restore caching
cd ..
git stash pop
cd build && touch ../src/arrow/compute/kernels/scalar_string_ascii.cc
cmake --build .

Results:

╔════════════════════════════════════════════════════════╗
║              BENCHMARK RESULTS                         ║
╠════════════════════════════════════════════════════════╣
║  WITHOUT Caching:      2.52 seconds                    ║
║  WITH Caching:         1.61 seconds                    ║
║  ─────────────────────────────────────                 ║
║  Time Saved:           0.91 seconds                    ║
║  Improvement:          36.1% FASTER                    ║
╚════════════════════════════════════════════════════════╝

Test Configuration:
  • Test: TestStringKernels/0.MatchSubstringRegex
  • Iterations: 200 repetitions
  • Pattern: Complex regex with groups/alternation
  • Per-operation: 12.6ms → 8.05ms (4.5ms saved)

Are there any user-facing changes?

No, this is an optiomization.

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

⚠️ GitHub issue #48728 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant