Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass scalar to eq inside nullif #11697

Merged
merged 5 commits into from
Aug 5, 2024
Merged

Conversation

simonvandel
Copy link
Contributor

@simonvandel simonvandel commented Jul 28, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

eq used inside the nullif has a performance specialization for scalar, but it was not used, as we never passed a Scalar into it.

What changes are included in this PR?

  • Add benchmark for nullif
  • Convert ScalarValue to Scalar, instead of Array of size 1
nullif scalar array: 1024
                        time:   [5.3127 µs 5.3305 µs 5.3526 µs]
                        change: [-1.2396% -0.9171% -0.5696%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe

nullif scalar array: 4096
                        time:   [16.483 µs 16.577 µs 16.675 µs]
                        change: [-4.6029% -3.9615% -3.1831%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

nullif scalar array: 8192
                        time:   [31.442 µs 31.527 µs 31.620 µs]
                        change: [-4.6155% -4.3827% -4.1375%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Are these changes tested?

Are there any user-facing changes?

@simonvandel simonvandel changed the title Properly specialize nullif for scalar (4x faster) Properly specialize nullif for scalar Jul 29, 2024
@simonvandel simonvandel changed the title Properly specialize nullif for scalar Pass scalar to eq inside nullif Jul 29, 2024
@simonvandel
Copy link
Contributor Author

In the first commmits of this PR, I saw a 4x perf increase, but CI pointed out that a test failed. It turned out the benchmark returned Error, which obviously was faster than doing any computation.

The error was arrorw-select::nullif not being able to handle scalar as first argument. After fixing the error by passing in a array of same size as rhs, the performance increase is much smaller (4%).

So in the end this PR might not give not benefit - feel free to close it if you want.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @simonvandel but if arrow-rs doesn't support, we should have a negative test case that fails?

let lhs = lhs.to_array_of_size(rhs.len())?;
let array = nullif(&lhs, &eq(&lhs, &rhs)?)?;
let lhs_s = lhs.to_scalar()?;
let lhs_a = lhs.to_array_of_size(rhs.len())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could update the arrow-rs nullif kernel to have a special case for a constant (arrow-rs calls this idea "Datum" rather than ScalarValue):

https://docs.rs/arrow/latest/arrow/array/trait.Datum.html

But you can get the Datum like this:https://docs.rs/datafusion/latest/datafusion/common/enum.ScalarValue.html#method.to_scalar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specializing arrow-rs's nullif kernel for a constant would be the absolute best. But if others can reproduce the small % perf increase of this PR, then perhaps it can be merged in isolation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good improvement regardless as it makes the DataFusion code follow the pattern so it can take advantage of the Datum special case if/when it is implemented.

Any chance you have a moment to file a ticket upstream in arrow-rs? If not, I will do so

@alamb
Copy link
Contributor

alamb commented Jul 30, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft July 30, 2024 19:54
Co-authored-by: Oleks V <[email protected]>
@simonvandel simonvandel marked this pull request as ready for review August 3, 2024 18:15
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @simonvandel and @comphead -- I think this is an improvement even if there are no measurable performance gains yet.

let lhs = lhs.to_array_of_size(rhs.len())?;
let array = nullif(&lhs, &eq(&lhs, &rhs)?)?;
let lhs_s = lhs.to_scalar()?;
let lhs_a = lhs.to_array_of_size(rhs.len())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good improvement regardless as it makes the DataFusion code follow the pattern so it can take advantage of the Datum special case if/when it is implemented.

Any chance you have a moment to file a ticket upstream in arrow-rs? If not, I will do so

@simonvandel
Copy link
Contributor Author

Any chance you have a moment to file a ticket upstream in arrow-rs? If not, I will do so

I created apache/arrow-rs#6193

@alamb alamb merged commit c6f0d3c into apache:main Aug 5, 2024
26 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 5, 2024

Thanks again @simonvandel and @comphead

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants