Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: scalar regex match physical expr #12270

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zhuliquan
Copy link
Contributor

Which issue does this PR close?

Closes #11146.

Rationale for this change

This PR is successor of PR #11455

BinaryExpr will compile literal regex pattern when it evaluating RecordBatch every time, Sometime, the time of compiling regex pattern is also expensive. In our approach, literal regex pattern will be compiled once and cached to be reused in execution. It's will save compile time of pre execution and speed up execution.

What changes are included in this PR?

  1. Introducing a new physical expr ScalarRegexMatchExpr to handle regexp match with literal regrex pattern.
  2. Introducing a message PhysicalScalarRegexMatchExprNode in proto to handle ScalarRegexMatchExpr and add arm in func parse_physical_expr and serialize_physical_expr.
  3. Changing BinaryExpr arm in create_physical_expr. Creating ScalarRegexMatchExpr instead of BinaryExpr when Rhs is string literal expr and op is RegexMatch | RegexIMatch | RegexNotMatch | RegexNotIMatch.

Are these changes tested?

Yes, test mod in scalar_regex_match.rs

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate proto Related to proto crate labels Aug 31, 2024
@alamb
Copy link
Contributor

alamb commented Sep 7, 2024

Thank you for this PR @zhuliquan . Have you run any benchmarks that show this approach is noticeably faster than the existing approach? It makes sense that it would be faster as it does not re-compile the regular expression for each batch, but I think it would help to quantify this difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is pre-compile pattern string in regexp_match operation
2 participants