Improve performance of DataPage statistics extraction using StringBuilder #11281

alamb · 2024-07-05T11:07:40Z

Originally posted by @efredine in #10922 (comment)

The StringBuilder pattern you suggested in #11136 (comment) does seem to materially improve performance:

Extract data page statistics for String/extract_statistics/String
                        time:   [15.368 µs 15.405 µs 15.446 µs]
                        change: [-68.672% -68.540% -68.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar.

The text was updated successfully, but these errors were encountered:

Rachelint · 2024-07-05T13:29:49Z

take

Rachelint · 2024-07-07T04:37:52Z

Ok... after profiling and reading the codes again, I got it about the Strange results got from poc...

The from_iter way is hard to solve the lifetime problem, and to_string() is called to make it, that leads to the extra memory allocation.
Use the StringBuilder directly seems to be the good and easy way to improve performance.

Strange results got from my poc...
After @efredine fixed the filter_map bug in #11295 , we can use StringArray::from_iter to relace collect + StringArray::from.
And the StringArray::from_iter is actually impl like:

        let iter = iter.into_iter();
        let mut builder = GenericByteBuilder::with_capacity(iter.size_hint().0, 1024);
        builder.extend(iter);
        builder.finish()

It seems that In theory, from_iter is at least no worse than use the StringBuilder directly in #https://github.com/apache/datafusion/pull/11136#discussion_r1657725214...

However the bench results are:

use collect + from as origin:

Extract data page statistics for String/extract_statistics/String
                        time:   [71.441 µs 71.748 µs 72.038 µs]
                        change: [-0.3091% +0.1457% +0.6048%] (p = 0.54 > 0.05)

use from_iter:

Extract data page statistics for String/extract_statistics/String
                        time:   [41.303 µs 41.358 µs 41.415 µs]

use StringBuilder directly:

Extract data page statistics for String/extract_statistics/String
                        time:   [15.471 µs 15.489 µs 15.508 µs]

The difference I found now is that the StringBuilder::new use default 1024 as the capacity to init the buffer, but in from_iter side, use size_hint() instead... Maybe it is related to this?
Perhap some profile works needed...

efredine · 2024-07-07T14:33:30Z

Yes - the primary (initial) problem is that the collection needs to be built so that it owns the references to the items but we want to do that without creating any intermediate values.

I would also have naively expected that from_iter should perform comparably. I did notice that there are two implementations for GenericByteArray and I'm not clear which one would be chosen here:
https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/byte_array.rs#L534-L552

The choice depends on lifetimes, so perhaps the other one is being invoked and it's not pre-allocating capacity in as efficient a way?

In general, I think you're right that we should be able to eliminate all intermediate vectors.

Rachelint · 2024-07-07T16:22:50Z

Yes - the primary (initial) problem is that the collection needs to be built so that it owns the references to the items but we want to do that without creating any intermediate values.

I would also have naively expected that from_iter should perform comparably. I did notice that there are two implementations for GenericByteArray and I'm not clear which one would be chosen here: https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/byte_array.rs#L534-L552

The choice depends on lifetimes, so perhaps the other one is being invoked and it's not pre-allocating capacity in as efficient a way?

In general, I think you're right that we should be able to eliminate all intermediate vectors.

😂 Sorry, it seems my statement made some misunderstandings?
The reason why from_iter case slow seems due to the to_string() call after I profile it... and I did not notice such to_string call when I first started read the codes...

But still somethings confused me now... When using the builder directly, looping in flatten way is obviously slower than looping it in nested way...

flatten case
https://github.com/Rachelint/arrow-datafusion/blob/70b9f05e737e81c514259e70a8bd2e9f0ad8e725/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L890-L893

Extract data page statistics for String/extract_statistics/String
                        time:   [43.454 µs 43.463 µs 43.474 µs]

nested case
https://github.com/Rachelint/arrow-datafusion/blob/70b9f05e737e81c514259e70a8bd2e9f0ad8e725/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L909-L913

Extract data page statistics for String/extract_statistics/String
                        time:   [33.108 µs 33.119 µs 33.132 µs]

I can understand some costs exist when using flatten, but it may should not be so obvious?

alamb · 2024-07-07T19:08:31Z

But still somethings confused me now... When using the builder directly, looping in flatten way is obviously slower than looping it in nested way...

Maybe the compiler is not smart enough to avoid bounds checks or something when using flatten 🤔

Rachelint · 2024-07-08T03:17:25Z

But still somethings confused me now... When using the builder directly, looping in flatten way is obviously slower than looping it in nested way...

Maybe the compiler is not smart enough to avoid bounds checks or something when using flatten 🤔

Maybe it is actually due to compiler...
I found the same thing in another bench about Boolean...
See https://github.com/Rachelint/arrow-datafusion/blob/82490204e6afecc3f9e15c2ce5d35d7e407f422f/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L752-L777

use flatten

Extract data page statistics for Boolean/extract_statistics/Boolean
                        time:   [11.990 µs 11.993 µs 11.997 µs]
                        change: [+18.603% +18.783% +18.917%] (p = 0.00 < 0.05)

use nested

Extract data page statistics for Boolean/extract_statistics/Boolean
                        time:   [10.143 µs 10.146 µs 10.149 µs]
                        change: [-15.525% -15.464% -15.403%] (p = 0.00 < 0.05)

alamb mentioned this issue Jul 5, 2024

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

Closed

23 tasks

github-actions bot assigned Rachelint Jul 5, 2024

Rachelint mentioned this issue Jul 7, 2024

Improve stats convert performance for Binary/String/Boolean arrays #11319

Merged

alamb closed this as completed in #11319 Jul 8, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of DataPage statistics extraction using StringBuilder #11281

Improve performance of DataPage statistics extraction using StringBuilder #11281

alamb commented Jul 5, 2024 •

edited

Loading

Rachelint commented Jul 5, 2024

Rachelint commented Jul 7, 2024 •

edited

Loading

efredine commented Jul 7, 2024

Rachelint commented Jul 7, 2024 •

edited

Loading

alamb commented Jul 7, 2024

Rachelint commented Jul 8, 2024

Improve performance of DataPage statistics extraction using StringBuilder #11281

Improve performance of DataPage statistics extraction using StringBuilder #11281

Comments

alamb commented Jul 5, 2024 • edited Loading

Rachelint commented Jul 5, 2024

Rachelint commented Jul 7, 2024 • edited Loading

efredine commented Jul 7, 2024

Rachelint commented Jul 7, 2024 • edited Loading

alamb commented Jul 7, 2024

Rachelint commented Jul 8, 2024

alamb commented Jul 5, 2024 •

edited

Loading

Rachelint commented Jul 7, 2024 •

edited

Loading

Rachelint commented Jul 7, 2024 •

edited

Loading