Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(11367): parquet writer defaults #34

Conversation

wiedld
Copy link
Collaborator

@wiedld wiedld commented Jul 16, 2024

WIP: first we need consensus on the desired defaults. Second, decide if the extern parquet's defaults can (or cannot) override the datafusion default.

Which issue does this PR close?

Closes apache#11367

Rationale for this change

When we switched from using the parquet's ArrowWriter (with options) to the parallelized parquet writer (with it's own options), we ran into unintended behaviors due to different default settings.

Here are the places where the current defaults differ:

setting_name applied default, datafusion (ParquetOptions) default, parquet (ArrowWriterOptions)
data_page_row_count_limit file usize::MAX 20_000
column_index_truncate_length file None Some(64)
compression column default zstd uncompressed
dictionary_enabled column default None or true † true
statistics_enabled column default None or page † page
max_statistics_size column default None or 4096 † 4096

† For these settings, datafusion has no default (None). However, once datafusion's ParquetOptions are used by the extern parquet (a.k.a. converted to parquet's ArrowWriterOptions) then it uses the extern parquet's defaults. Refer to the newly added tests.

.

Additionally, there are differences in the bloom filter configurations based upon partial definition (a.k.a. leaving some as default, and some as defined):

bloom_filter_on fpp ndv if build from datafusion's ParquetOptions if build from (arrow-rs) parquet's WriterPropertiesBuilder
false none none None None
TRUE none none Some(BloomFilterProperties::default) None
true SOME none Some(BloomFilterProperties: fpp,default_ndv) None
true none SOME Some(BloomFilterProperties: ndv,default_fpp) None

What changes are included in this PR?

The first commit is adding tests to define and demonstrate the differences in the defaults.
After discussion and consensus, we'll add other commits (as needed) for implementing the desired changes.

Are these changes tested?

Yes.

Are there any user-facing changes?

No API changes. Only potential future changes to alleviate unintended consequences.

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wiedld -- this PR looks nice. I left some comments

Once we merge apache#11444 I think this would be a great PR to send upstream

"datafusion's default is None"
);

// TODO: matches once create WriterProps, but only due to parquet's override
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment -- is there any action item here?

Copy link
Collaborator Author

@wiedld wiedld Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another commit to re-work the code comments for this test. Hopefully that makes the misalignments more clear.

Please let me know @alamb if I should change anything else before moving this over into an upstream PR.

datafusion/common/src/file_options/parquet_writer.rs Outdated Show resolved Hide resolved
datafusion/common/src/file_options/parquet_writer.rs Outdated Show resolved Hide resolved
@wiedld
Copy link
Collaborator Author

wiedld commented Jul 17, 2024

Have opened the upstream PR.
Closing this earlier draft (in our fork).

@wiedld wiedld closed this Jul 17, 2024
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants