Skip to content

Conversation

@vgankidi
Copy link
Contributor

@vgankidi vgankidi commented Jul 4, 2019

Lower and upper bound values from Parquet files are not currently truncated, which takes more space than necessary in manifests. Truncating strings and binary values will probably improve performance for large tables.

This PR adds a configurable table property "write.metadata.truncate-length" with a default value of 16. Default behavior is to truncate binary values to <= 16 bytes and strings to <= 16 unicode characters.

Resolves #113

@rdblue
Copy link
Contributor

rdblue commented Jul 5, 2019

@vgankidi, it looks like this is based on an old version of your branch. Could you update it to the current?

@rdblue
Copy link
Contributor

rdblue commented Jul 5, 2019

@edgarRd, FYI. This PR has methods to truncate binary and string values to avoid writing huge stats in table metadata. Probably something you'll want to use for ORC.

}

public static Metrics footerMetrics(ParquetMetadata metadata) {
return footerMetrics(metadata, TableProperties.WRITE_METADATA_TRUNCATE_BYTES_DEFAULT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need for this to be left. The truncate length should always be included when getting metrics.

The version that uses the default is called from two places:

  • ParquetWriteAdapter that is created by Parquet.write when the internal ParquetWriter is not used
  • SparkTableUtil that reads Parquet footers when converting Hive tables to Iceberg metadata

I think both should be updated. The write adapter should use the config setting, and the Spark util method can default this inline. Then we can get rid of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would be good to get rid of this. I updated the PR with these changes.

@rdblue rdblue merged commit 3287fee into apache:master Jul 6, 2019
@rdblue
Copy link
Contributor

rdblue commented Jul 6, 2019

Thanks for the quick fixes, @vgankidi! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Truncate stats from Parquet files

2 participants