Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of DataPage statistics extraction using StringBuilder #11281

Closed
Tracked by #10922
alamb opened this issue Jul 5, 2024 · 6 comments · Fixed by #11319
Closed
Tracked by #10922

Improve performance of DataPage statistics extraction using StringBuilder #11281

alamb opened this issue Jul 5, 2024 · 6 comments · Fixed by #11319
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

Originally posted by @efredine in #10922 (comment)

The StringBuilder pattern you suggested in #11136 (comment) does seem to materially improve performance:

Extract data page statistics for String/extract_statistics/String
                        time:   [15.368 µs 15.405 µs 15.446 µs]
                        change: [-68.672% -68.540% -68.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar.

@alamb alamb changed the title Further to the performance discussion @alamb - the StringBuilder pattern you suggested in https://github.com/apache/datafusion/pull/11136#discussion_r1657725214 does seem to materially improve performance: Improve performance of DataPage statistics extraction using StringBuilder Jul 5, 2024
@Rachelint
Copy link
Contributor

take

@Rachelint
Copy link
Contributor

Rachelint commented Jul 7, 2024

Ok... after profiling and reading the codes again, I got it about the Strange results got from poc...

The from_iter way is hard to solve the lifetime problem, and to_string() is called to make it, that leads to the extra memory allocation.
Use the StringBuilder directly seems to be the good and easy way to improve performance.


Strange results got from my poc...
After @efredine fixed the filter_map bug in #11295 , we can use StringArray::from_iter to relace collect + StringArray::from.
And the StringArray::from_iter is actually impl like:

        let iter = iter.into_iter();
        let mut builder = GenericByteBuilder::with_capacity(iter.size_hint().0, 1024);
        builder.extend(iter);
        builder.finish()

It seems that In theory, from_iter is at least no worse than use the StringBuilder directly in #https://github.com/apache/datafusion/pull/11136#discussion_r1657725214...

However the bench results are:

  • use collect + from as origin:
Extract data page statistics for String/extract_statistics/String
                        time:   [71.441 µs 71.748 µs 72.038 µs]
                        change: [-0.3091% +0.1457% +0.6048%] (p = 0.54 > 0.05)
  • use from_iter:
Extract data page statistics for String/extract_statistics/String
                        time:   [41.303 µs 41.358 µs 41.415 µs]
  • use StringBuilder directly:
Extract data page statistics for String/extract_statistics/String
                        time:   [15.471 µs 15.489 µs 15.508 µs]

The difference I found now is that the StringBuilder::new use default 1024 as the capacity to init the buffer, but in from_iter side, use size_hint() instead... Maybe it is related to this?
Perhap some profile works needed...

@efredine
Copy link
Contributor

efredine commented Jul 7, 2024

Yes - the primary (initial) problem is that the collection needs to be built so that it owns the references to the items but we want to do that without creating any intermediate values.

I would also have naively expected that from_iter should perform comparably. I did notice that there are two implementations for GenericByteArray and I'm not clear which one would be chosen here:
https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/byte_array.rs#L534-L552

The choice depends on lifetimes, so perhaps the other one is being invoked and it's not pre-allocating capacity in as efficient a way?

In general, I think you're right that we should be able to eliminate all intermediate vectors.

@Rachelint
Copy link
Contributor

Rachelint commented Jul 7, 2024

Yes - the primary (initial) problem is that the collection needs to be built so that it owns the references to the items but we want to do that without creating any intermediate values.

I would also have naively expected that from_iter should perform comparably. I did notice that there are two implementations for GenericByteArray and I'm not clear which one would be chosen here: https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/byte_array.rs#L534-L552

The choice depends on lifetimes, so perhaps the other one is being invoked and it's not pre-allocating capacity in as efficient a way?

In general, I think you're right that we should be able to eliminate all intermediate vectors.

😂 Sorry, it seems my statement made some misunderstandings?
The reason why from_iter case slow seems due to the to_string() call after I profile it... and I did not notice such to_string call when I first started read the codes...

But still somethings confused me now... When using the builder directly, looping in flatten way is obviously slower than looping it in nested way...

Extract data page statistics for String/extract_statistics/String
                        time:   [43.454 µs 43.463 µs 43.474 µs]
Extract data page statistics for String/extract_statistics/String
                        time:   [33.108 µs 33.119 µs 33.132 µs]

I can understand some costs exist when using flatten, but it may should not be so obvious?

@alamb
Copy link
Contributor Author

alamb commented Jul 7, 2024

But still somethings confused me now... When using the builder directly, looping in flatten way is obviously slower than looping it in nested way...

Maybe the compiler is not smart enough to avoid bounds checks or something when using flatten 🤔

@Rachelint
Copy link
Contributor

But still somethings confused me now... When using the builder directly, looping in flatten way is obviously slower than looping it in nested way...

Maybe the compiler is not smart enough to avoid bounds checks or something when using flatten 🤔

Maybe it is actually due to compiler...
I found the same thing in another bench about Boolean...
See https://github.com/Rachelint/arrow-datafusion/blob/82490204e6afecc3f9e15c2ce5d35d7e407f422f/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L752-L777

  • use flatten
Extract data page statistics for Boolean/extract_statistics/Boolean
                        time:   [11.990 µs 11.993 µs 11.997 µs]
                        change: [+18.603% +18.783% +18.917%] (p = 0.00 < 0.05)
  • use nested
Extract data page statistics for Boolean/extract_statistics/Boolean
                        time:   [10.143 µs 10.146 µs 10.149 µs]
                        change: [-15.525% -15.464% -15.403%] (p = 0.00 < 0.05)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants