-
Notifications
You must be signed in to change notification settings - Fork 168
Description
The current way to track only the Iceberg snapshot-ID as the on-reference-state via IcebergTable is insufficient. There are several issues with it:
- Schema changes in different branches are not properly tracked.
- Iceberg schema-ID, partitionSpec-ID and sortOrder-ID are not properly tracked
- The snapshot-log in
TableMetadatais not maintained
Although Iceberg's Snapshot class has a field schemaId, it is not enough to track schema changes on branches, because a schema change does not produce a new snapshot in Iceberg. This means, that the snapshot on the Nessie reference still contains the schema-ID for that snapshot, but not the effective schema-ID after the DDL. For example:
CREATE TABLE foo (col STRING);
-- table is created in Nessie, snapshotId == -1, tableMetadata.schemaId == 0
INSERT INTO foo ('hello');
-- snapshot created, snapshotId == 42, tableMetadata.currentSchemaId == 0, snapshot.schemaId == 0
ALTER TABLE foo ADD COLUMN other STRING;
-- no new snapshot, snapshotId == 42, tableMetadata.currentSchemaId == 1, snapshot.schemaId == 0
INSERT INTO foo ('bar', 'baz');
-- above statement fails, because tableMetadata.currentSchemaId is set to the snapshot's schemaIdTherefore we need to track at least the schemaId and probably also the partitionSpecId and sortOrderId in the on-reference-state in Nessie.
Related issue is that TableMetadata.snapshotLog is validated to contain TableMetadata.currentSnapshotId as the last entry. This becomes a problem when the above example is "enhanced" w/ working on multiple branches:
CREATE TABLE foo (col STRING);
-- schemaId == 0
INSERT INTO foo ('hello');
-- snapshotId == 1
CREATE BRANCH branch_1;
CREATE BRANCH branch_2;
USE BRANCH branch_1;
-- snapshotId == 1
ALTER TABLE foo ADD COLUMN added_1 STRING;
-- schemaId == 1
INSERT INTO foo VALUES ('a', 'b');
-- snapshotId == 2
USE BRANCH branch_1;
-- snapshotId == 1
ALTER TABLE foo ADD COLUMN added_2 STRING;
-- fails in TableMetadata.removeSnapshotLogEntries(), because snapshotId == 1, but last one in TableMetadata.snapshotLog == 2
INSERT INTO foo VALUES ('c', 'd');