[core/iceberg] Added optional snapshot summary fields to iceberg metadata #6503
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: close #6502
This PR fixes Redshift Spectrum querying for Paimon tables with Iceberg compatibility by populating optional snapshot summary fields that are required by certain Iceberg query engines.
When Paimon generates Iceberg metadata, it currently only includes the
operationfield in snapshot summaries. While the Iceberg specification marks most summary fields as "optional," some query engines (notably AWS Redshift Spectrum) require fields liketotal-recordsto successfully parse and query tables.This causes Paimon+Iceberg tables to be queryable in AWS Athena but fail in Redshift Spectrum with error:
Required field total-records missing.Changes
Added
computeSnapshotSummary()Helper MethodAggregates statistics from
IcebergManifestFileMetaobjects to compute snapshot-level metrics including:Required fields (always present):
total-records- Total number of live recordstotal-data-files- Total number of live data filestotal-delete-files- Total number of live delete filestotal-position-deletes- Total position delete recordstotal-equality-deletes- Always "0" (Paimon doesn't use equality deletes)Optional fields (when non-zero):
added-data-files,added-records,added-files-sizedeleted-data-files,deleted-records,deleted-files-sizetotal-files-sizechanged-partition-countTests
Updated
IcebergMetadataTest.javaAPI and Format
N/A
Documentation
Reintroduces a feature that was previously available.
Redshift Spectrum can now query the table.