Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is pre-compile pattern string in regexp_match operation #11146

Open
zhuliquan opened this issue Jun 27, 2024 · 3 comments · May be fixed by #12270
Open

Is pre-compile pattern string in regexp_match operation #11146

zhuliquan opened this issue Jun 27, 2024 · 3 comments · May be fixed by #12270
Labels
enhancement New feature or request

Comments

@zhuliquan
Copy link
Contributor

Is your feature request related to a problem or challenge?

I noticed below code:

RegexMatch => binary_string_array_flag_op_scalar!(
array,
scalar,
regexp_is_match,
false,
false
),
RegexIMatch => binary_string_array_flag_op_scalar!(
array,
scalar,
regexp_is_match,
false,
true
),
RegexNotMatch => binary_string_array_flag_op_scalar!(
array,
scalar,
regexp_is_match,
true,
false
),
RegexNotIMatch => binary_string_array_flag_op_scalar!(
array,
scalar,
regexp_is_match,
true,
true
),

This looks like every time record_batch is evaluated, it will execute the compiled pattern string and use the compiled results to match arrow-array

Describe the solution you'd like

when building binary physical expr , we can pre-compile pattern string if op is regex_match

Describe alternatives you've considered

No response

Additional context

No response

@zhuliquan zhuliquan added the enhancement New feature or request label Jun 27, 2024
@alamb
Copy link
Contributor

alamb commented Jun 27, 2024

Hi @zhuliquan -- I believe you are correct that the regexp is re-compiled for each batch. This is mentioned as one of the use cases for #8051

I think it would be a nice improvement to look into pre-compiling the regexp somehow, though last time I checked the only benchmark we have (one of the clickbench queries) compiling the regexp was not a significant consumer of time

@zhuliquan
Copy link
Contributor Author

Hi @zhuliquan -- I believe you are correct that the regexp is re-compiled for each batch. This is mentioned as one of the use cases for #8051

I think it would be a nice improvement to look into pre-compiling the regexp somehow, though last time I checked the only benchmark we have (one of the clickbench queries) compiling the regexp was not a significant consumer of time

I think should pre-compiled when create binary-physical-expr, instead of evaluating record_batch.

pub fn new(
left: Arc<dyn PhysicalExpr>,
op: Operator,
right: Arc<dyn PhysicalExpr>,
) -> Self {
Self { left, op, right }
}

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

I think should pre-compiled when create binary-physical-expr, instead of evaluating record_batch.

I think this makes sense to me. Thank you

One thing to look into might be to replace the regexp match operator with a function (and have the function's implementation store the precompiled regexp)

One tricky part might be serialization (in datafusion-proto). We can probably handle it via an extension codec or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants