Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance #10054

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

JasonLi-cn
Copy link
Contributor

Which issue does this PR close?

Closes #10053

Rationale for this change

Benchmark

overlay function

Gnuplot not found, using plotters backend
4args_with_3scalars/overlay/1024
                        time:   [36.540 µs 36.560 µs 36.587 µs]
                        change: [-12.476% -12.385% -12.284%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

4args_with_3scalars/overlay/4096
                        time:   [141.54 µs 141.65 µs 141.82 µs]
                        change: [-12.584% -12.425% -12.231%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

4args_with_3scalars/overlay/8192
                        time:   [282.77 µs 283.42 µs 284.20 µs]
                        change: [-13.658% -13.353% -12.997%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4args_without_scalar/overlay/1024
                        time:   [37.682 µs 37.717 µs 37.757 µs]
                        change: [+1.8225% +1.9984% +2.1754%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

4args_without_scalar/overlay/4096
                        time:   [147.68 µs 147.77 µs 147.91 µs]
                        change: [+2.7595% +3.0052% +3.2531%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  4 (4.00%) high mild
  16 (16.00%) high severe

4args_without_scalar/overlay/8192
                        time:   [298.84 µs 299.63 µs 300.27 µs]
                        change: [+1.6336% +2.0972% +2.5438%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Discussion🤔

According to the benchmark, there can be about a 12% performance improvement 🚀 when there are 3 scalar args; however, when all args are arrays, there is a 2% ~ 3% performance loss 😔. Does anyone have any ideas on how to avoid this loss?

What changes are included in this PR?

Currently, this approach has only been validated based on the overlay function. If everyone thinks it's appropriate, it can be applied to other functions as well.

Are these changes tested?

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JasonLi-cn -- this looks neat.

In terms of performance, I left a sugestion about how to improve this specific PR -- let me know what you think

@@ -88,10 +105,13 @@ pub fn overlay<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> {
let characters_array = as_generic_string_array::<T>(&args[1])?;
let pos_num = as_int64_array(&args[2])?;

let characters_array_iter = adaptive_array_iter(characters_array.iter());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the benchmark, there can be about a 12% performance improvement 🚀 when there are 3 scalar args; however, when all args are arrays, there is a 2% ~ 3% performance loss 😔. Does anyone have any ideas on how to avoid this loss?

One reason it may be getting slower is that there is overhead now during iteration (where it has to switch on Left/Right in each element

One thing to try would be to avoid the Left/Right iterator by instantiating different loops depending on the input types

So something like

fn<CI, PI> overlay3(string_array: &StringArray, characters_array_iter: CI, pos_num_iter: PI) -> Result<ArrayRef> 
where
  CI: Iterator<Item = char>,
  PI: Iterator<Item =usize>
{
    let result = string_array
                .iter()
                .zip(characters_array_iter)
                .zip(pos_num_iter)
                .map(|((string, characters), start_pos)| {
...
}

And then you need to instantiate versions based on lengths:

match (character_array.len(), pos_num.len()) { 
  (1, 1) => {
        let c_value = character_array.next().expect("Contains a value");
        let p_value = pos_num.next().expect("Contains a value");
        overlay3(character_array, std::iter::repeat(c_value, len), std;:iter::repeat(p_value, len))
  (1, _) => {
        let c_value = character_array.next().expect("Contains a value");
        overlay3(character_array, std::iter::repeat(c_value, len), pos_num.iter())
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for your suggestion. However, if we were to make changes according to your suggestion, many functions would require significant modifications, such as lpad, rpad, strpos, and so on.
Here, I am seeking a universal and elegant method: based on the current function implementations, with only minimal modifications, we can optimize for the Scalar case without degrading the performance for Arrays.
(Maybe my thinking is not correct, and I need your help to provide suggestions for modification 🙏)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for your suggestion. However, if we were to make changes according to your suggestion, many functions would require significant modifications, such as lpad, rpad, strpos, and so on.

Yes, I agree with this assesment

Here, I am seeking a universal and elegant method: based on the current function implementations, with only minimal modifications, we can optimize for the Scalar case without degrading the performance for Arrays.

Thank you -- this makes sense.

I think the only way to do this is to have compile time specialized implementations for the cases you want to optimize.

(Maybe my thinking is not correct, and I need your help to provide suggestions for modification 🙏)

I suspect the solution will look something like a scalar (rust) function that actually implements the the operation and then a macro / generic that instantiates the function in different ways depending on argument type

For example

/// do the actual operation, calling `output` for each string produced
fn overlay3<F: Fn(&str)> (output: F, input: &str, len: usize, pos: usize)  {...}

And then make a macro or other templated functin that instantiates overlay3 in different loops depending on the arguments (columns, scalars, etc).

You should be able to avoid code duplication in the source code, though we will need a copy a runtime

Hoepfully that helps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to the suggestions you provided, I attempted to make modifications, and according to the benchmark results, there is no significant change before and after optimization when the arg is an Array.
However, I think the current implementation of the macro is not very good. Please help provide better suggestions for modification.

New benchmark

Gnuplot not found, using plotters backend
4args_with_3scalars/overlay/1024
                        time:   [36.730 µs 36.969 µs 37.206 µs]
                        change: [-12.359% -11.851% -11.372%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4args_with_3scalars/overlay/4096
                        time:   [139.93 µs 140.35 µs 140.81 µs]
                        change: [-13.155% -12.784% -12.412%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4args_with_3scalars/overlay/8192
                        time:   [284.61 µs 285.42 µs 286.43 µs]
                        change: [-12.164% -11.692% -11.219%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4args_without_scalar/overlay/1024
                        time:   [36.599 µs 36.683 µs 36.765 µs]
                        change: [-0.1637% +0.1015% +0.3769%] (p = 0.48 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4args_without_scalar/overlay/4096
                        time:   [142.05 µs 142.41 µs 142.84 µs]
                        change: [-1.5247% -1.2324% -0.9312%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4args_without_scalar/overlay/8192
                        time:   [291.79 µs 292.42 µs 293.19 µs]
                        change: [-2.9181% -1.9757% -1.1922%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @JasonLi-cn -- I will try and take another look later this week

.zip(characters_array.iter())
.zip(pos_num.iter())
.zip(characters_array_iter)
.zip(pos_num_iter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking at the innter loop of this function, you can probably get a major boost by avoiding String::with_capacoty and instead reuse the same String on each element. It might also be possible to skip the copy by using StringBuilder directly rather than copying to a temp string

@alamb
Copy link
Contributor

alamb commented May 15, 2024

Marking as a draft to make it clear this PR isn't waiting on feedback anymore (I am trying to reduce the outstanding PR review queue)

@JasonLi-cn
Copy link
Contributor Author

Marking as a draft to make it clear this PR isn't waiting on feedback anymore (I am trying to reduce the outstanding PR review queue)

Is this PR not satisfactory enough for performance? Do I need to make further improvements to this PR? 🤔

@alamb
Copy link
Contributor

alamb commented May 16, 2024

Sorry @JasonLi-cn -- marking as ready for review so we can give it another look

@alamb alamb marked this pull request as ready for review May 16, 2024 18:32
@alamb
Copy link
Contributor

alamb commented May 16, 2024

@JasonLi-cn
Copy link
Contributor Author

It seems like this PR is failing a CI check: https://github.com/apache/datafusion/actions/runs/8771259487/job/24068788239?pr=10054

I originally intended to modify the Cargo.lock after it was approved. 😅

@JasonLi-cn
Copy link
Contributor Author

Sorry @JasonLi-cn -- marking as ready for review so we can give it another look

It doesn't matter. For this PR, I personally think it is valuable, but in terms of code implementation, I am not sure if there is a better solution, which needs the help of friends in the community. 🙏

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jul 17, 2024
@JasonLi-cn JasonLi-cn marked this pull request as draft July 17, 2024 02:13
@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Jul 18, 2024
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimization: make the most of Hint::AcceptsSingular when call make_scalar_function to Improve performance
2 participants