Spec: Update row lineage requirements for upgrading tables #12781

rdblue · 2025-04-12T00:09:49Z

This updates the spec to better handle upgrading tables to v3 and to be simpler.

Before, the spec stated that row IDs are assigned the first time a row is modified. Now row IDs are assigned for all added and existing rows in the first snapshot created after a table is upgraded to v3. This ensures that rows have IDs after the first commit to a branch.

In order to assign row IDs to existing files, this updates the inheritance rules:

Any live data file without first_row_id will be assigned one (ADDED or EXISTING) via inheritance
The first_row_id assigned for a data file is the manifest's first_row_id plus sum(record_count) for any other data file assigned before the data file
Any manifest without first_row_id in the manifest list must be assigned one at write time
The first_row_id assigned to a manifest is the snapshot's first-row-id plus sum(existing_row_count + added_row_count) for any other manifest assigned before the manifest

I think that leaving ID space for both existing and added rows makes the feature simpler: any unassigned data file or data manifest will be assigned. That way we don't need to try to track whether existing files were inadvertently assigned a first_row_id when only added data files should be.

RussellSpitzer · 2025-04-12T00:52:38Z

format/spec.md

 #### First Row ID Assignment

-When adding a new data manifest file, its `first_row_id` field is assigned the value of the snapshot's `first_row_id` plus the sum of `added_rows_count` for all data manifests that preceded the manifest in the manifest list.
+When adding a new data manifest file, its `first_row_id` field is assigned the value of the snapshot's `first_row_id` plus the sum of `added_rows_count` and `existing_rows_count` for all data manifests that preceded the manifest in the manifest list and were assigned a `first_row_id`.


Above you said "had a null ..." and i think that's a little bit clearer than "were assigned a"

Updated, but this is a bit tricky because these are simply new manifests. Having a null first_row_id is an implementation detail that is not written into metadata. I've updated it to say both:

When adding a new data manifest file, its first_row_id field is assigned the value of the snapshot's first_row_id plus the sum of added_rows_count and existing_rows_count for all new data manifests that preceded it in the manifest list; that is, those that had a null first_row_id and were assigned one.

I also reformatted this section a little.

RussellSpitzer · 2025-04-12T00:58:13Z

format/spec.md

-The `first_row_id` is only inherited for added data files. The inherited value must be written into the data file metadata for existing and deleted entries. The value of `first_row_id` for delete files is always `null`.
+The inherited value of `first_row_id` must be written into the data file metadata when creating existing and deleted entries. The value of `first_row_id` for delete files is always `null`.
+
+In most cases, only added files will be assigned a new `first_row_id` via inheritance, but any unassigned `first_row_id` must be assigned to handle manifests in upgraded tables that have not yet assigned `first_row_id` for existing entries.


This sentence is a little difficult, I think there is a redundancy which is making a it a little confusing with "any unassigned ... that have not yet assigned"

Maybe we can shorten this to.:

Suggested change

In most cases, only added files will be assigned a new `first_row_id` via inheritance, but any unassigned `first_row_id` must be assigned to handle manifests in upgraded tables that have not yet assigned `first_row_id` for existing entries.

Assignment of `first_row_id` usually only applies to newly added files but during table format version upgrades existing files will also have a null value for `first_row_id` and must also be assigned.

?

I updated this to:

Any null (unassigned) first_row_id must be assigned via inheritance, even if the data file is existing. This ensures that row IDs are assigned to existing data files in upgraded tables in the first commit after upgrading to v3.

There's no need to talk about "usually" and make assumptions.

RussellSpitzer · 2025-04-12T01:10:11Z

format/spec.md

-fields. The values for `_row_id` and `_last_updated_sequence_number` must always return null and when these rows are copied, 
-null must be explicitly written. After this point, rows are treated as if they were just created 
-and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows.
+* Snapshots from before upgrading to v3 do not have row IDs.


So does this mean that we aren't going to do any work on upgrade to assign new row Id's to existing snapshots? We'll only do this on the creation of future snapshots created after the upgrade?

I feel like we can handle this now by just setting next-row-id on all snapshots using the logic below. Then all previous snapshots can get row_id

I think how existing snapshots are handled can probably can be left up to implementations? But I agree that in the reference implementation it makes more sense to just do it on upgrade

I think that's fine, I realize that actually adding post-facto row lineage does require a manifest list rewrite for past snapshots in order to differentiate between existing and added manifests

Yes, I don't think that we should add row lineage for older snapshots because it creates a lot of problems that we don't have good ways to solve and doesn't create much, if any, value.

Here are some of the problems:

To add row IDs to older snapshots, we would need to change the snapshot metadata. This could be cached so changing it may cause strange issues, but the larger problem is that modifying it would require coordination (to avoid losing the changes) and changes to the REST protocol to replace an older snapshot

If we were to add row IDs to older snapshots, then those row IDs would not be very useful without an expensive operation that rewrites the whole metadata tree.

If we assigned new row IDs independently in each snapshot, then a data file would have different row IDs across versions -- making the IDs useless.

That means we would need to track data files and update the metadata tree with consistent first_row_id values.

Even with the expensive rewrite of the whole metadata tree, the _row_id column would be missing from data files so the lineage of individual rows is not available.

I think the best approach is to not modify older snapshots. The goal with this update is to ensure that we have good row lineage from the next snapshot after upgrade and forward, when we can write _row_id. That's why I updated the first_row_id assignment to include any unassigned data file or data manifest, and changed the assignment strategy to leave space for existing rows. With those changes we have created an invariant: new snapshots in v3 tables always have IDs assigned to all rows.

Branching is still a little difficult because we don't want to analyze branch history (which may be lost) and attempt to assign consistently. With this update, branches get separate IDs but it leave open the possibility to do some external analysis and assign the same IDs for the same data files.

I replied here before I saw a comment on the implementation PR that was similar. Here's the link to my response there: #12672 (comment)

I think I summarized the issues better on that one:

I think the argument against filling in historical row IDs is that it is extra work to think through the issues and build an algorithm to modify old data. And when we've done that extra work, we don't actually gain anything: you still can't reason about the IDs of records between branches or even within a branch.

My conclusion is that we should just expose the data that we know, which is that the row ID was null until the row was assigned an ID in a branch.

Thanks for the detailed explanation, yeah I'm convinced now that rewriting the older snapshots adds questionable value at the cost of a lot of complexity.

amogh-jahagirdar · 2025-04-12T15:53:43Z

format/spec.md

+When adding a new data manifest file, its `first_row_id` field is assigned the value of the snapshot's `first_row_id` plus the sum of `added_rows_count` and `existing_rows_count` for all data manifests that preceded the manifest in the manifest list and were assigned a `first_row_id`.

-The `first_row_id` is only assigned for new data manifests. Values for existing manifests must be preserved when writing a new manifest list. The value of `first_row_id` for delete manifests is always `null`.
+The `first_row_id` is only assigned for data manifests that do not have a non-null `first_row_id`. Values for existing manifests must be preserved when writing a new manifest list. The value of `first_row_id` for delete manifests is always `null`.


Should we change that do not have a non-null to that have a null? Feels easier to read

I updated this to be more clear:

The first_row_id is only assigned for new data manifests that do not have a first_row_id.

That avoids assuming null because null isn't written for those manifests.

amogh-jahagirdar · 2025-04-12T16:23:09Z

format/spec.md

-fields. The values for `_row_id` and `_last_updated_sequence_number` must always return null and when these rows are copied, 
-null must be explicitly written. After this point, rows are treated as if they were just created 
-and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows.
+* Snapshots from before upgrading to v3 do not have row IDs.


I think how existing snapshots are handled can probably can be left up to implementations? But I agree that in the reference implementation it makes more sense to just do it on upgrade

format/spec.md

rdblue · 2025-04-14T23:00:37Z

format/spec.md


 The snapshot's `first-row-id` is the starting `first_row_id` assigned to manifests in the snapshot's manifest list.

-The snapshot's `added-rows` is the sum of all the  [`added_rows_count`](#manifest-lists) in all added manifests.


@RussellSpitzer, I've removed added-rows because I think that it is misleading and makes row ID range assignment more complicated.

In this PR, row ID range assignment is based on the total number of existing or added rows in new manifests. That leaves room for any data files that are missing first_row_id even if the files are existing and not added. That may not be the total number of added rows so I don't think it makes sense to have added-rows in the snapshot.

That's fine, we only had this added in order to pass the information from the snapshot into the table metadata. I believe now that logic has moved we don't have that issue.

rdblue · 2025-04-16T20:14:22Z

format/spec.md

 #### Snapshot Row IDs

-A snapshot's `first-row-id` is assigned to the table's current `next-row-id` on each commit attempt. If a commit is retried, the `first-row-id` must be reassigned. If a commit contains no new rows, `first-row-id` should be omitted.
+A snapshot's `first-row-id` is assigned to the table's current `next-row-id` on each commit attempt. If a commit is retried, the `first-row-id` must be reassigned based on the table's current `next-row-id`. The `first-row-id` field is required even if a commit does not assign any ID space.


Always writing first-row-id distinguishes snapshots that added 0 rows from pre-v3 snapshots.

RussellSpitzer · 2025-04-16T21:37:34Z

format/spec.md

+The `first_row_id` for existing manifests must be preserved when writing a new manifest list. The value of `first_row_id` for delete manifests is always `null`. The `first_row_id` is only assigned for data manifests that do not have a `first_row_id`. Assignment must account for data files that will be assigned `first_row_id` values when the manifest is read.

-The `first_row_id` is only assigned for new data manifests. Values for existing manifests must be preserved when writing a new manifest list. The value of `first_row_id` for delete manifests is always `null`.
+The first manifest without a `first_row_id` is assigned a value that is greater than or equal to the `first_row_id` of the snapshot. Subsequent manifests without a `first_row_id` are assigned one based on the previous manifest to be assigned a `first_row_id`. Each assigned `first_row_id` must increase by the row count of all files that will be assigned a `first_row_id` via inheritance in the last assigned manifest. That is, each `first_row_id` must be greater than or equal to the last assigned `first_row_id` plus the total record count of data files with a null `first_row_id` in the last assigned manifest.


Suggested change

The first manifest without a `first_row_id` is assigned a value that is greater than or equal to the `first_row_id` of the snapshot. Subsequent manifests without a `first_row_id` are assigned one based on the previous manifest to be assigned a `first_row_id`. Each assigned `first_row_id` must increase by the row count of all files that will be assigned a `first_row_id` via inheritance in the last assigned manifest. That is, each `first_row_id` must be greater than or equal to the last assigned `first_row_id` plus the total record count of data files with a null `first_row_id` in the last assigned manifest.

The first manifest without a `first_row_id` is assigned a value that is greater than or equal to the `first_row_id` of the snapshot. Subsequent manifests without a `first_row_id` are assigned a value by summing the previously assigned `first_row_id` and the row count of all files that will be assigned `first_row_id` via inheritance in that previously assigned manifest. Each `first_row_id` must be greater than or equal to the last assigned `first_row_id` plus the total record count of data files with a null `first_row_id` in the last assigned manifest.

Tried to simplify this a bit, not sure if I succeeded

The second sentence is intended to clarify how to interpret the requirement (the "must be") by stating that the first_row_id that is assigned is based on the last manifest to be assigned a first_row_id -- without saying how it is "based on" that manifest. My intent was to make the "how" sentence easier to understand, rather than stating the same complicated thing twice.

The complication I'm trying to avoid is the distinction between the manifest that precedes the one being assigned a first_row_id and the last manifest to be assigned a first_row_id.

Yeah I think the Last sentence is clear, but the second sentence just sounds more complicated to me. I'm fine with this as is though, i think the examples make this clear

RussellSpitzer · 2025-04-16T21:40:47Z

format/spec.md

-fields. The values for `_row_id` and `_last_updated_sequence_number` must always return null and when these rows are copied, 
-null must be explicitly written. After this point, rows are treated as if they were just created 
-and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows.
+* Snapshots created before upgrading to v3 do not have row IDs.


I think we should add a note that, when reading from these snapshots all rows should return null (without inheritance as their value for "row_id" or "_last_updated_sequence_number"

I think I did above in the first paragraph because it is part of the specification rather than a consequence of the rules:

For such snapshots without first-row-id, first_row_id values for data files and data manifests are null, and values for _row_id are read as null for all rows. When first_row_id is null, inherited row ID values are also null.

I think we've had some confusion here (at least amongst folks I've talked to) about when "reading as null" and "reading null as something else".

RussellSpitzer · 2025-04-17T17:42:13Z

format/spec.md

 The snapshot then populates the total number of `added-rows` based on the sum of all added rows in the manifests: 100 (50 + 50)

-When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225:
+When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 375 rows were in data files in manifests that were assigned a `first_row_id` (`added1` 100+25, `added2` 0+100, `added3` 125+25) the new value is 1,000 + 375 = 1,375.


Nit, but we have spaces before and after + in all the other examples

szehon-ho

Looks good to me, left a comment to understand part of it better

szehon-ho · 2025-04-17T17:42:19Z

format/spec.md

 #### Snapshot Row IDs

-A snapshot's `first-row-id` is assigned to the table's current `next-row-id` on each commit attempt. If a commit is retried, the `first-row-id` must be reassigned. If a commit contains no new rows, `first-row-id` should be omitted.
+A snapshot's `first-row-id` is assigned to the table's current `next-row-id` on each commit attempt. If a commit is retried, the `first-row-id` must be reassigned based on the table's current `next-row-id`. The `first-row-id` field is required even if a commit does not assign any ID space.


Note: this seems a good clarification, if i understand this correctly, the first-row-id could be the same value even on the next attempt (if next-row-id is not changed), before it seemed like it implied it needed to be different.

szehon-ho · 2025-04-17T17:50:06Z

format/spec.md

-null must be explicitly written. After this point, rows are treated as if they were just created 
-and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows.
+* Snapshots created before upgrading to v3 do not have row IDs.
+* After upgrading, new snapshots in different branches will assign disjoint ID ranges to existing data files, based on the table's `next-row-id` when the snapshot is committed. For a data file in multiple branches, a writer may write the `first_row_id` from another branch or may assign a new `first_row_id` to the data file (to avoid large metadata rewrites).


if i understand, this means if same data file exists on different branch, can have same first_row_id? It seems the two sentence contradict (first sentence specifies disjoint Id ranges). Would it be more clear:

After upgrading, new snapshots in different branches will assign disjoint ID ranges to existing distinct data files, based on the table's next-row-id when the snapshot is committed. For the same data file in multiple branches...

During upgrade it is possible that an existing row can exist on two different branches with different row ids, after upgrade this will not be possible for new rows.

What this is saying is that when another branch is updated, all of the files in that branch must be assigned IDs by the v3 snapshot, and unless the writer does some additional work to find and write the row IDs for the same data file in other branches, the IDs will be assigned for the branch.

This also says (the last sentence) that the writer can choose to do that extra work, find the first_row_id used in another branch for the file, and use it to have consistent IDs across the branches. This is optional.

I think makes sense. I still read the two sentence as contradict (as the first sentence specifies 'disjoint', but second sentence says we can optionally re-use the other branch first_row_id), hence my suggestion if it makes sense

rdblue · 2025-04-20T00:13:40Z

The vote passed. Thanks for the reviews, everyone!

Spec: Update row lineage requirements for upgrading tables.

d46c6f9

rdblue requested review from RussellSpitzer and danielcweeks April 12, 2025 00:09

github-actions bot added the Specification Issues that may introduce spec changes. label Apr 12, 2025

RussellSpitzer reviewed Apr 12, 2025

View reviewed changes

amogh-jahagirdar reviewed Apr 12, 2025

View reviewed changes

rdblue commented Apr 14, 2025

View reviewed changes

format/spec.md Outdated Show resolved Hide resolved

rdblue commented Apr 14, 2025

View reviewed changes

rdblue added 3 commits April 14, 2025 16:12

Update for review comments.

0db277a

Use "increase by" wording rather than strict first_row_id assignment.

46527d4

Fix typo.

d0c9704

rdblue commented Apr 16, 2025

View reviewed changes

RussellSpitzer reviewed Apr 16, 2025

View reviewed changes

Doc: Make next-row-id required in v3

261e8f1

RussellSpitzer reviewed Apr 17, 2025

View reviewed changes

szehon-ho approved these changes Apr 17, 2025

View reviewed changes

RussellSpitzer approved these changes Apr 17, 2025

View reviewed changes

amogh-jahagirdar approved these changes Apr 17, 2025

View reviewed changes

stevenzwu approved these changes Apr 17, 2025

View reviewed changes

rdblue mentioned this pull request Apr 17, 2025

Spec: Make next-row-id required in v3 #12757

Closed

aihuaxu approved these changes Apr 17, 2025

View reviewed changes

huaxingao approved these changes Apr 17, 2025

View reviewed changes

rdblue mentioned this pull request Apr 18, 2025

Spark: Add _row_id and _last_updated_sequence_number readers #12836

Merged

rdblue merged commit edc7f2a into apache:main Apr 20, 2025
2 checks passed

	In most cases, only added files will be assigned a new `first_row_id` via inheritance, but any unassigned `first_row_id` must be assigned to handle manifests in upgraded tables that have not yet assigned `first_row_id` for existing entries.
	Assignment of `first_row_id` usually only applies to newly added files but during table format version upgrades existing files will also have a null value for `first_row_id` and must also be assigned.


		The snapshot's `first-row-id` is the starting `first_row_id` assigned to manifests in the snapshot's manifest list.

		The snapshot's `added-rows` is the sum of all the [`added_rows_count`](#manifest-lists) in all added manifests.

Spec: Update row lineage requirements for upgrading tables #12781

Spec: Update row lineage requirements for upgrading tables #12781

Uh oh!

Conversation

rdblue commented Apr 12, 2025

Uh oh!

RussellSpitzer Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Apr 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

RussellSpitzer Apr 12, 2025 •

edited

Loading

rdblue Apr 14, 2025 •

edited

Loading

RussellSpitzer Apr 12, 2025 •

edited

Loading

szehon-ho Apr 17, 2025 •

edited

Loading