Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: precompile literal regex pattern #11455

Closed

Conversation

zhuliquan
Copy link
Contributor

it's help for saving compile time for each evaluating of record batch

Which issue does this PR close?

#11146

Closes #.

Rationale for this change

it's help for saving compile time for each evaluating of record batch

What changes are included in this PR?

1、precompile literal regex pattern when build binary expr
2、using precompiled regex for match when evaluating.

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions bot added the physical-expr Physical Expressions label Jul 13, 2024
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution 👍🏼

Before regex is compiled once a batch (default 8k rows), this PR makes it once per execution.
I'm not familiar with the overhead of regular expression, if there is some case compiling a complicated regex is way more expensive than evaluating it, this PR would be necessary.

I left some minor suggestions.

datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved
let right = lit($B_SCALAR);

// verify that we can construct the expression
let expression = binary(left, $OP, right, &schema).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do an extra check: the regex is compiled inside the expression.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @zhuliquan and @2010YOUY01 for the review

I am concerned about special casing BinaryExpr in this way. I left some thoughts. Let me know what you think

pub struct BinaryExpr {
left: Arc<dyn PhysicalExpr>,
op: Operator,
right: Arc<dyn PhysicalExpr>,
/// Specifies whether an error is returned on overflow or not
fail_on_overflow: bool,
/// Only used when evaluating literal regex expressions. Example regex expression: c1 ~ '^a'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why you proceeded down this route. However, I think it is somewhat strange to have something so regexp specific attached to BinaryExpr

I wonder if it is possible to convert regex expressions like ~ to function calls and then use the existing function call API https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.simplify to rewrite to a new regular expression if the argument is a constant

Here is the regexp_like function: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/regex/regexplike.rs

Though now that I look at this I wonder why there are separate implementations for ~ and regexp_like 🤔

@zhuliquan
Copy link
Contributor Author

@alamb I thought a idea that we can attach lru-cache (key is regex, value is result of compiled regex) to struct RegexpLikeFunc.

#[derive(Debug)]
pub struct RegexpLikeFunc {
    lru_cache: LruCache<String, regex::Regex>,
    signature: Signature,
}
impl Default for RegexpLikeFunc {
    fn default() -> Self {
        Self::new()
    }
}

impl RegexpLikeFunc {
    pub fn new() -> Self {
        use DataType::*;
        Self {
            signature: Signature::one_of(
                vec![
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![LargeUtf8, Utf8]),
                    Exact(vec![Utf8, Utf8, Utf8]),
                    Exact(vec![LargeUtf8, Utf8, Utf8]),
                ],
                Volatility::Immutable,
            ),
            lru_cache: LruCache::new(NonZeroUsize::new(1024).unwrap()),
        }
    }
}

We can use cache like way in regexp_is_match_utf8.
https://github.com/apache/arrow-rs/blob/af40ea382275dba967bfabc1632fded07d2129b9/arrow-string/src/regexp.rs#L50
https://github.com/apache/arrow-rs/blob/af40ea382275dba967bfabc1632fded07d2129b9/arrow-string/src/regexp.rs#L83-L95

it's make full use of result of compiled regex. and this way can be applied for scalar or array two cases.

@zhuliquan zhuliquan force-pushed the feature-precompile_regex_pattern branch from de5735b to a0d3598 Compare July 25, 2024 16:27
pub struct BinaryExpr {
left: Arc<dyn PhysicalExpr>,
op: Operator,
right: Arc<dyn PhysicalExpr>,
/// Specifies whether an error is returned on overflow or not
fail_on_overflow: bool,
/// Only used when evaluating literal regex expressions. Example regex expression: c1 ~ '^a'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bit of a leaky abstraction or hack just for this one special case (= partial evaluation of regexes). If wonder if we could do something more elegant that would also work for other similar operators:

  1. introduce a new struct CompiledRegex that holds the complied regex.
  2. implement PhysicalExpr for CompiledRegex
  3. change the regex evaluation of the binary expression to check if the RHS (= BinaryExpr::right is CompiledExpr and use this pre-compiled value
  4. change the optimizer pass to just replace the RHS instead attaching the cached data to BinaryExpr

Another approach would be to lower/pre-compile the regex expressions to a custom data type, but currently arrow-rs doesn't support that, see apache/arrow-rs#4472

@alamb alamb marked this pull request as draft August 14, 2024 21:36
@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@github-actions github-actions bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion core Core DataFusion crate proto Related to proto crate labels Aug 30, 2024
@zhuliquan zhuliquan closed this Aug 31, 2024
@zhuliquan zhuliquan force-pushed the feature-precompile_regex_pattern branch from cb2a242 to 7db4213 Compare August 31, 2024 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants