Fix global and on-reference state #2313

snazy · 2021-10-10T10:15:00Z

Change the way Iceberg table information is maintained:

The pointer to the table-metadata is moved (back) to the on-reference state - that is: into the commit itself
Information from Iceberg that influences newly written data files (think: last-column-id and last-partition-id) are retrieved from Iceberg (approach pending consensus) as a JSON that is opaque to Nessie

codecov · 2021-10-10T11:48:11Z

Codecov Report

Merging #2313 (6df5378) into main (f37eb40) will decrease coverage by 0.04%.
The diff coverage is 89.18%.

@@             Coverage Diff              @@
##               main    #2313      +/-   ##
============================================
- Coverage     84.18%   84.14%   -0.05%     
  Complexity     1888     1888              
============================================
  Files           241      241              
  Lines         10664    10655       -9     
  Branches        758      759       +1     
============================================
- Hits           8978     8966      -12     
- Misses         1378     1380       +2     
- Partials        308      309       +1

Flag	Coverage Δ
java	`83.93% <91.42%> (-0.05%)`	⬇️
javascript	`86.61% <ø> (ø)`
python	`85.82% <50.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/pynessie/model.py	`90.54% <50.00%> (ø)`
...essie/server/store/TableCommitMetaStoreWorker.java	`72.82% <75.00%> (-2.46%)`	⬇️
...ain/java/org/projectnessie/model/IcebergTable.java	`100.00% <100.00%> (ø)`
.../org/projectnessie/jaxrs/AbstractResteasyTest.java	`100.00% <100.00%> (ø)`
...java/org/projectnessie/jaxrs/AbstractTestRest.java	`94.99% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f37eb40...6df5378. Read the comment docs.

nastra

changes LGTM, but we also need to adjust python code

model/src/main/java/org/projectnessie/model/IcebergTable.java

ajantha-bhat · 2021-10-19T11:55:44Z

servers/store/src/main/proto/table.proto

-message IcebergSnapshot {
-  int64 snapshot_id = 1;
+message IcebergGlobal {
+  string id_generators = 1;


This is for overall design.

why are we not using latest committed metadata_location for global state ?
Can you please explain what is the drawback with that approach (Approach-5 in the design doc)?

I see two drawbacks with current id string approach
a) Iceberg may not want us to track their inner spec (so, if they reject iceberg side PR, we have to rework again)
may be have a meeting with them before concluding on this design.
b) Iceberg may add new fields that we may forget to track or add. This also need a code change each time iceberg adds new fields (last_XXX)

@ajantha-bhat exactly the proposed approach was discussed yesterday during the call w/ the Iceberg people and generally +1'd. As the object in here is an Iceberg thing and opaque to Nessie.

omarsmak · 2021-10-19T12:59:16Z

servers/store/src/main/proto/table.proto

-    IcebergTableMetadata iceberg_table_metadata = 1;
-    IcebergSnapshot iceberg_snapshot = 2;
+    IcebergMetadataPointer iceberg_metadata_pointer = 1;
+    IcebergGlobal iceberg_global = 2;


Do we need to handle the case of ICEBERG_GLOBAL in TableCommitMetaStoreWorker? 🤔

We do.
(Note: global's not an on-reference state, in case you're confused b/c toStoreOnReferenceState)

Ah I got it now, indeed I was confused with I overlooked toStoreGlobalState 👍🏽

omarsmak

LGTM

dimas-b · 2021-10-19T13:38:15Z

model/src/main/java/org/projectnessie/model/IcebergTable.java

+  public abstract String getIdGenerators();

-  public static IcebergTable of(String metadataLocation, long snapshotId) {
+  public static IcebergTable of(String metadataLocation, String idGenerators) {


What happened to snapshotId do we no longer track it in Nessie?

How do we handle going back to an old snapshot now?

Either you refer to a specific Nessie commit (as before) or you use Iceberg's time-travel functionality.

The table-metadata's tracked now (again) - so it's currentSnapshotId is the one

servers/store/src/main/java/org/projectnessie/server/store/TableCommitMetaStoreWorker.java

…tate The current way to track only the Iceberg snapshot-ID as the on-reference-state via `IcebergTable` is insufficient. There are several issues with it: 1. Schema changes in different branches are not properly tracked. 2. Iceberg schema-ID, partitionSpec-ID and sortOrder-ID are not properly tracked 3. The snapshot-log in `TableMetadata` is not maintained Although Iceberg's `Snapshot` class has a field `schemaId`, it is not enough to track schema changes on branches, because a schema change does _not_ produce a new snapshot in Iceberg. This means, that the snapshot on the Nessie reference still contains the schema-ID for that snapshot, but not the effective schema-ID after the DDL. For example: ```sql CREATE TABLE foo (col STRING); -- table is created in Nessie, snapshotId == -1, tableMetadata.schemaId == 0 INSERT INTO foo ('hello'); -- snapshot created, snapshotId == 42, tableMetadata.currentSchemaId == 0, snapshot.schemaId == 0 ALTER TABLE foo ADD COLUMN other STRING; -- no new snapshot, snapshotId == 42, tableMetadata.currentSchemaId == 1, snapshot.schemaId == 0 INSERT INTO foo ('bar', 'baz'); -- above statement fails, because tableMetadata.currentSchemaId is set to the snapshot's schemaId ``` Therefore we need to track at least the schemaId and probably also the partitionSpecId and sortOrderId in the on-reference-state in Nessie. Related issue is that `TableMetadata.snapshotLog` is validated to contain `TableMetadata.currentSnapshotId` as the last entry. This becomes a problem when the above example is "enhanced" w/ working on multiple branches: ```sql CREATE TABLE foo (col STRING); -- schemaId == 0 INSERT INTO foo ('hello'); -- snapshotId == 1 CREATE BRANCH branch_1; CREATE BRANCH branch_2; USE BRANCH branch_1; -- snapshotId == 1 ALTER TABLE foo ADD COLUMN added_1 STRING; -- schemaId == 1 INSERT INTO foo VALUES ('a', 'b'); -- snapshotId == 2 USE BRANCH branch_1; -- snapshotId == 1 ALTER TABLE foo ADD COLUMN added_2 STRING; -- fails in TableMetadata.removeSnapshotLogEntries(), because snapshotId == 1, but last one in TableMetadata.snapshotLog == 2 INSERT INTO foo VALUES ('c', 'd'); ``` Fixes projectnessie#2312

servers/store/src/main/java/org/projectnessie/server/store/TableCommitMetaStoreWorker.java

nastra

overall LGTM, I just find the name TableIdGenerators (the plural form) rather confusing

…,spec,sort-order ids as on-ref state This change basically reverts projectnessie#2313. It moves the Iceberg able-metadata-pointer back to Nessie's global state and snapshot id plus other relevant ids (schema-id, partition-spec-id, sort-order-id) to Nessie's on-reference state.

…,spec,sort-order ids as on-ref state (#2626) This change basically reverts #2313. It moves the Iceberg able-metadata-pointer back to Nessie's global state and snapshot id plus other relevant ids (schema-id, partition-spec-id, sort-order-id) to Nessie's on-reference state.

snazy requested a review from nastra October 10, 2021 10:15

snazy force-pushed the track-iceberg-more branch from ac753a4 to 61f8ec8 Compare October 10, 2021 11:17

nastra suggested changes Oct 11, 2021

View reviewed changes

snazy marked this pull request as draft October 11, 2021 08:45

snazy force-pushed the track-iceberg-more branch 2 times, most recently from 6a7bdbd to 7513851 Compare October 12, 2021 14:22

snazy changed the title ~~Also track schemaId, specId, sortOrderId for Iceberg's on-reference-state~~ Fix global and on-reference state Oct 12, 2021

snazy force-pushed the track-iceberg-more branch 3 times, most recently from 90cb31c to 2936890 Compare October 19, 2021 07:26

snazy requested a review from nastra October 19, 2021 07:26

snazy marked this pull request as ready for review October 19, 2021 07:26

snazy added pr-jackson pr-native run native test labels Oct 19, 2021

snazy requested a review from omarsmak October 19, 2021 07:26

snazy force-pushed the track-iceberg-more branch from 2936890 to d93dcfe Compare October 19, 2021 07:51

snazy added this to the 0.11.0 milestone Oct 19, 2021

omarsmak reviewed Oct 19, 2021

View reviewed changes

model/src/main/java/org/projectnessie/model/IcebergTable.java Show resolved Hide resolved

ajantha-bhat reviewed Oct 19, 2021

View reviewed changes

snazy force-pushed the track-iceberg-more branch from d93dcfe to 0cea4d2 Compare October 19, 2021 12:36

omarsmak reviewed Oct 19, 2021

View reviewed changes

omarsmak previously approved these changes Oct 19, 2021

View reviewed changes

dimas-b reviewed Oct 19, 2021

View reviewed changes

snazy added 3 commits October 19, 2021 15:52

Change global + on-ref state tracking in Nessie

4d3b36a

use id-generators object

ab61461

snazy dismissed omarsmak’s stale review via ab61461 October 19, 2021 13:52

snazy force-pushed the track-iceberg-more branch from 0cea4d2 to ab61461 Compare October 19, 2021 13:52

omarsmak previously approved these changes Oct 19, 2021

View reviewed changes

snazy dismissed omarsmak’s stale review via 46cc543 October 19, 2021 13:59

omarsmak previously approved these changes Oct 19, 2021

View reviewed changes

snazy dismissed omarsmak’s stale review via 32b0c36 October 19, 2021 14:00

snazy force-pushed the track-iceberg-more branch from 46cc543 to 32b0c36 Compare October 19, 2021 14:00

dimas-b reviewed Oct 19, 2021

View reviewed changes

servers/store/src/main/java/org/projectnessie/server/store/TableCommitMetaStoreWorker.java Outdated Show resolved Hide resolved

review

6df5378

snazy force-pushed the track-iceberg-more branch from 32b0c36 to 6df5378 Compare October 19, 2021 14:07

omarsmak approved these changes Oct 19, 2021

View reviewed changes

dimas-b approved these changes Oct 19, 2021

View reviewed changes

nastra approved these changes Oct 19, 2021

View reviewed changes

snazy merged commit 4c303a6 into projectnessie:main Oct 19, 2021

snazy deleted the track-iceberg-more branch October 19, 2021 15:23

ajantha-bhat mentioned this pull request Oct 26, 2021

Update Nessie spec asper #2312 #2501

Closed

snazy mentioned this pull request Nov 9, 2021

Move table-metadata-pointer back to global state, use snapshot,schema,spec,sort-order ids as on-ref state #2626

Merged

Fix global and on-reference state #2313

Fix global and on-reference state #2313

Uh oh!

Conversation

snazy commented Oct 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajantha-bhat Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omarsmak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

snazy commented Oct 10, 2021 •

edited

Loading

codecov bot commented Oct 10, 2021 •

edited

Loading

ajantha-bhat Oct 19, 2021 •

edited

Loading