Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent value for data_page_max_rows setting in DataFusion ParquetOptions and in ArrowWriterOptions #11367

Closed
alamb opened this issue Jul 9, 2024 · 6 comments · Fixed by #11656
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

Describe the bug

@wiedld pointed out that the default value of data_page_max_rows is different in DataFusion than in arrow-rs

https://docs.rs/datafusion/latest/datafusion/common/config/struct.ParquetOptions.html sets it to usize::max

As of apache/arrow-rs#5957 was released in 51.1.0 , arrow-rs sets it to 20,000

To Reproduce

N/A

Expected behavior

The DataFusion defaults should be the same as the arrow-rs https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html, unless there is a good reason to deviate

In this case, I think we should go with the arrow-rs value

Ideally a fix for this ticket would both

  1. Change the defaults of ParquetOptions to match the default in ArrowWriterOptions
  2. Write a test that verified the default values were the same

Additional context

No response

@alamb alamb added the bug Something isn't working label Jul 9, 2024
@Rachelint
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2024

#11444 is similar

@wiedld
Copy link
Contributor

wiedld commented Jul 12, 2024

Edited

My goal was to delineate the differences btwn the APIs more, but to leave the actual fixing of the defaults (& tests to enforce that it remains in sync) as this issue assigned to @Rachelint .

I was hoping #11444 could actually help make this issue/PR (setting defaults) be easier to test. I could also add in the changes to the defaults as well, if @Rachelint is ok with that?

@Rachelint
Copy link
Contributor

Edited

My goal was to delineate the differences btwn the APIs more, but to leave the actual fixing of the defaults (& tests to enforce that it remains in sync) as this issue assigned to @Rachelint .

I was hoping #11444 could actually help make this issue/PR (setting defaults) be easier to test. I could also add in the changes to the defaults as well, if @Rachelint is ok with that?

Ok, feel free to make it.

@Rachelint Rachelint removed their assignment Jul 14, 2024
@wiedld
Copy link
Contributor

wiedld commented Jul 15, 2024

take

@alamb
Copy link
Contributor Author

alamb commented Jul 17, 2024

@wiedld 's pr #11524 has a very nice table of the current state of affairs:

Here are the places where the current defaults differ:

setting_name applied default, datafusion (ParquetOptions) default, parquet (ArrowWriterOptions)
data_page_row_count_limit file usize::MAX 20_000
column_index_truncate_length file None Some(64)
compression column default zstd uncompressed
dictionary_enabled column default None or true † true
statistics_enabled column default None or page † page
max_statistics_size column default None or 4096 † 4096

† For these settings, datafusion has no default (None). However, once datafusion's ParquetOptions are used by the extern parquet (a.k.a. converted to parquet's ArrowWriterOptions) then it uses the extern parquet's defaults. Refer to the newly added tests.

.

Additionally, there are differences in the bloom filter configurations based upon partial definition (a.k.a. leaving some as default, and some as defined):

bloom_filter_on fpp ndv if build from datafusion's ParquetOptions if build from (arrow-rs) parquet's WriterPropertiesBuilder
false none none None None
TRUE none none Some(BloomFilterProperties::default) None
true SOME none Some(BloomFilterProperties: fpp,default_ndv) None
true none SOME Some(BloomFilterProperties: ndv,default_fpp) None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants