Skip to content

Conversation

@0dunay0
Copy link
Contributor

@0dunay0 0dunay0 commented Oct 31, 2025

Purpose

Linked issue: close #6502

This PR fixes Redshift Spectrum querying for Paimon tables with Iceberg compatibility by populating optional snapshot summary fields that are required by certain Iceberg query engines.

When Paimon generates Iceberg metadata, it currently only includes the operation field in snapshot summaries. While the Iceberg specification marks most summary fields as "optional," some query engines (notably AWS Redshift Spectrum) require fields like total-records to successfully parse and query tables.

This causes Paimon+Iceberg tables to be queryable in AWS Athena but fail in Redshift Spectrum with error: Required field total-records missing.

Changes

Added computeSnapshotSummary() Helper Method

Aggregates statistics from IcebergManifestFileMeta objects to compute snapshot-level metrics including:

Required fields (always present):

  • total-records - Total number of live records
  • total-data-files - Total number of live data files
  • total-delete-files - Total number of live delete files
  • total-position-deletes - Total position delete records
  • total-equality-deletes - Always "0" (Paimon doesn't use equality deletes)

Optional fields (when non-zero):

  • added-data-files, added-records, added-files-size
  • deleted-data-files, deleted-records, deleted-files-size
  • total-files-size
  • changed-partition-count

Tests

Updated IcebergMetadataTest.java

API and Format

N/A

Documentation

Reintroduces a feature that was previously available.

aws s3 cp s3://some-bucket/paimon/warehouse/somedb.db/some_table/metadata/v190.metadata.json - | jq '.snapshots[0].summary'

{
  "added-data-files": "2",
  "total-equality-deletes": "0",
  "added-records": "83282",
  "deleted-data-files": "0",
  "deleted-records": "0",
  "total-records": "83282",
  "deleted-files-size": "0",
  "changed-partition-count": "1",
  "total-position-deletes": "0",
  "added-files-size": "4683766",
  "total-delete-files": "0",
  "total-files-size": "4683766",
  "total-data-files": "2",
  "operation": "append"
}

Redshift Spectrum can now query the table.

Copy link

@rajagopal-ravikumar rajagopal-ravikumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@0dunay0
Copy link
Contributor Author

0dunay0 commented Nov 3, 2025

@JingsongLi Can you take a look at your convenience please?

@0dunay0 0dunay0 force-pushed the feature/populate-iceberg-snapshot-summary branch from cf9c152 to 788570d Compare November 3, 2025 12:25
.sum();
}

private long computeDeleteRowCount(List<IcebergManifestFileMeta> manifestMetas) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeLiveRowCount is the same with computeDeleteRowCount now, please check them.

long totalPositionDeletes = Math.max(0, metrics.totalPositionDeletes);
long totalEqualityDeletes = Math.max(0, metrics.totalEqualityDeletes);

summary.put("added-data-files", Long.toString(addedDataFiles));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to add some constants definition for these metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Regression in Redshift Spectrum Querying Iceberg Table

3 participants