-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Truncate stats from Parquet files #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@vgankidi, it looks like this is based on an old version of your branch. Could you update it to the current? |
|
@edgarRd, FYI. This PR has methods to truncate binary and string values to avoid writing huge stats in table metadata. Probably something you'll want to use for ORC. |
| } | ||
|
|
||
| public static Metrics footerMetrics(ParquetMetadata metadata) { | ||
| return footerMetrics(metadata, TableProperties.WRITE_METADATA_TRUNCATE_BYTES_DEFAULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a need for this to be left. The truncate length should always be included when getting metrics.
The version that uses the default is called from two places:
ParquetWriteAdapterthat is created byParquet.writewhen the internalParquetWriteris not usedSparkTableUtilthat reads Parquet footers when converting Hive tables to Iceberg metadata
I think both should be updated. The write adapter should use the config setting, and the Spark util method can default this inline. Then we can get rid of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would be good to get rid of this. I updated the PR with these changes.
|
Thanks for the quick fixes, @vgankidi! Merged. |
Lower and upper bound values from Parquet files are not currently truncated, which takes more space than necessary in manifests. Truncating strings and binary values will probably improve performance for large tables.
This PR adds a configurable table property "write.metadata.truncate-length" with a default value of 16. Default behavior is to truncate binary values to <= 16 bytes and strings to <= 16 unicode characters.
Resolves #113