Skip to content

Verify state of Iceberg TableMetadata & Snapshot vs Nessie on-reference-state #2312

@snazy

Description

@snazy

The current way to track only the Iceberg snapshot-ID as the on-reference-state via IcebergTable is insufficient. There are several issues with it:

  1. Schema changes in different branches are not properly tracked.
  2. Iceberg schema-ID, partitionSpec-ID and sortOrder-ID are not properly tracked
  3. The snapshot-log in TableMetadata is not maintained

Although Iceberg's Snapshot class has a field schemaId, it is not enough to track schema changes on branches, because a schema change does not produce a new snapshot in Iceberg. This means, that the snapshot on the Nessie reference still contains the schema-ID for that snapshot, but not the effective schema-ID after the DDL. For example:

CREATE TABLE foo (col STRING);
-- table is created in Nessie, snapshotId == -1, tableMetadata.schemaId == 0
INSERT INTO foo ('hello');
-- snapshot created, snapshotId == 42, tableMetadata.currentSchemaId == 0, snapshot.schemaId == 0
ALTER TABLE foo ADD COLUMN other STRING;
-- no new snapshot, snapshotId == 42, tableMetadata.currentSchemaId == 1, snapshot.schemaId == 0
INSERT INTO foo ('bar', 'baz');
-- above statement fails, because tableMetadata.currentSchemaId is set to the snapshot's schemaId

Therefore we need to track at least the schemaId and probably also the partitionSpecId and sortOrderId in the on-reference-state in Nessie.

Related issue is that TableMetadata.snapshotLog is validated to contain TableMetadata.currentSnapshotId as the last entry. This becomes a problem when the above example is "enhanced" w/ working on multiple branches:

CREATE TABLE foo (col STRING);
-- schemaId == 0
INSERT INTO foo ('hello');
-- snapshotId == 1

CREATE BRANCH branch_1;
CREATE BRANCH branch_2;

USE BRANCH branch_1;
-- snapshotId == 1
ALTER TABLE foo ADD COLUMN added_1 STRING;
-- schemaId == 1
INSERT INTO foo VALUES ('a', 'b');
-- snapshotId == 2

USE BRANCH branch_1;
-- snapshotId == 1
ALTER TABLE foo ADD COLUMN added_2 STRING;
-- fails in TableMetadata.removeSnapshotLogEntries(), because snapshotId == 1, but last one in TableMetadata.snapshotLog == 2
INSERT INTO foo VALUES ('c', 'd');

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions