-
Notifications
You must be signed in to change notification settings - Fork 170
Fix global and on-reference state #2313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ac753a4 to
61f8ec8
Compare
Codecov Report
@@ Coverage Diff @@
## main #2313 +/- ##
============================================
- Coverage 84.18% 84.14% -0.05%
Complexity 1888 1888
============================================
Files 241 241
Lines 10664 10655 -9
Branches 758 759 +1
============================================
- Hits 8978 8966 -12
- Misses 1378 1380 +2
- Partials 308 309 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes LGTM, but we also need to adjust python code
6a7bdbd to
7513851
Compare
90cb31c to
2936890
Compare
2936890 to
d93dcfe
Compare
| message IcebergSnapshot { | ||
| int64 snapshot_id = 1; | ||
| message IcebergGlobal { | ||
| string id_generators = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for overall design.
why are we not using latest committed metadata_location for global state ?
Can you please explain what is the drawback with that approach (Approach-5 in the design doc)?
I see two drawbacks with current id string approach
a) Iceberg may not want us to track their inner spec (so, if they reject iceberg side PR, we have to rework again)
may be have a meeting with them before concluding on this design.
b) Iceberg may add new fields that we may forget to track or add. This also need a code change each time iceberg adds new fields (last_XXX)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajantha-bhat exactly the proposed approach was discussed yesterday during the call w/ the Iceberg people and generally +1'd. As the object in here is an Iceberg thing and opaque to Nessie.
d93dcfe to
0cea4d2
Compare
| IcebergTableMetadata iceberg_table_metadata = 1; | ||
| IcebergSnapshot iceberg_snapshot = 2; | ||
| IcebergMetadataPointer iceberg_metadata_pointer = 1; | ||
| IcebergGlobal iceberg_global = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle the case of ICEBERG_GLOBAL in TableCommitMetaStoreWorker? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do.
(Note: global's not an on-reference state, in case you're confused b/c toStoreOnReferenceState)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I got it now, indeed I was confused with I overlooked toStoreGlobalState 👍🏽
omarsmak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| public abstract String getIdGenerators(); | ||
|
|
||
| public static IcebergTable of(String metadataLocation, long snapshotId) { | ||
| public static IcebergTable of(String metadataLocation, String idGenerators) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to snapshotId do we no longer track it in Nessie?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle going back to an old snapshot now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either you refer to a specific Nessie commit (as before) or you use Iceberg's time-travel functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table-metadata's tracked now (again) - so it's currentSnapshotId is the one
servers/store/src/main/java/org/projectnessie/server/store/TableCommitMetaStoreWorker.java
Outdated
Show resolved
Hide resolved
…tate
The current way to track only the Iceberg snapshot-ID as the on-reference-state via `IcebergTable` is insufficient. There are several issues with it:
1. Schema changes in different branches are not properly tracked.
2. Iceberg schema-ID, partitionSpec-ID and sortOrder-ID are not properly tracked
3. The snapshot-log in `TableMetadata` is not maintained
Although Iceberg's `Snapshot` class has a field `schemaId`, it is not enough to track schema changes on branches, because a schema change does _not_ produce a new snapshot in Iceberg. This means, that the snapshot on the Nessie reference still contains the schema-ID for that snapshot, but not the effective schema-ID after the DDL. For example:
```sql
CREATE TABLE foo (col STRING);
-- table is created in Nessie, snapshotId == -1, tableMetadata.schemaId == 0
INSERT INTO foo ('hello');
-- snapshot created, snapshotId == 42, tableMetadata.currentSchemaId == 0, snapshot.schemaId == 0
ALTER TABLE foo ADD COLUMN other STRING;
-- no new snapshot, snapshotId == 42, tableMetadata.currentSchemaId == 1, snapshot.schemaId == 0
INSERT INTO foo ('bar', 'baz');
-- above statement fails, because tableMetadata.currentSchemaId is set to the snapshot's schemaId
```
Therefore we need to track at least the schemaId and probably also the partitionSpecId and sortOrderId in the on-reference-state in Nessie.
Related issue is that `TableMetadata.snapshotLog` is validated to contain `TableMetadata.currentSnapshotId` as the last entry. This becomes a problem when the above example is "enhanced" w/ working on multiple branches:
```sql
CREATE TABLE foo (col STRING);
-- schemaId == 0
INSERT INTO foo ('hello');
-- snapshotId == 1
CREATE BRANCH branch_1;
CREATE BRANCH branch_2;
USE BRANCH branch_1;
-- snapshotId == 1
ALTER TABLE foo ADD COLUMN added_1 STRING;
-- schemaId == 1
INSERT INTO foo VALUES ('a', 'b');
-- snapshotId == 2
USE BRANCH branch_1;
-- snapshotId == 1
ALTER TABLE foo ADD COLUMN added_2 STRING;
-- fails in TableMetadata.removeSnapshotLogEntries(), because snapshotId == 1, but last one in TableMetadata.snapshotLog == 2
INSERT INTO foo VALUES ('c', 'd');
```
Fixes projectnessie#2312
0cea4d2 to
ab61461
Compare
46cc543 to
32b0c36
Compare
servers/store/src/main/java/org/projectnessie/server/store/TableCommitMetaStoreWorker.java
Outdated
Show resolved
Hide resolved
32b0c36 to
6df5378
Compare
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, I just find the name TableIdGenerators (the plural form) rather confusing
…,spec,sort-order ids as on-ref state This change basically reverts projectnessie#2313. It moves the Iceberg able-metadata-pointer back to Nessie's global state and snapshot id plus other relevant ids (schema-id, partition-spec-id, sort-order-id) to Nessie's on-reference state.
…,spec,sort-order ids as on-ref state This change basically reverts projectnessie#2313. It moves the Iceberg able-metadata-pointer back to Nessie's global state and snapshot id plus other relevant ids (schema-id, partition-spec-id, sort-order-id) to Nessie's on-reference state.
Change the way Iceberg table information is maintained:
Fixes #2312