Skip to content

Conversation

@amoeba
Copy link
Member

@amoeba amoeba commented Jul 15, 2025

No description provided.

Comment on lines 126 to 131
A new feature named Content-Defined Chunking improves deduplication of Parquet
files with mostly identical contents, by choosing data page boundaries based on
actual contents rather than a number of values. For that, it uses a rolling hash
function, and the min and max chunk size can be chosen. The feature is disabled by
default and can be enabled on a per-file basis in the Parquet `WriterProperties`
(GH-45750).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs Do you think this is a good description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @kszucs I may merge this as-is but if you have any edits after the merge feel free to ping me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I haven't noticed the ping. Yes, it is great, thank you!

@amoeba amoeba changed the title [Website] Add blog post for Arrow 21.0.0. [Website] Add blog post for Arrow 21.0.0 Jul 17, 2025
@amoeba
Copy link
Member Author

amoeba commented Jul 19, 2025

Thanks all for the contributions. I'll merge this tomorrow unless anyone has more updates before then.

@amoeba amoeba merged commit d8bfecf into apache:main Jul 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.