Skip to content

Conversation

@rdblue
Copy link

@rdblue rdblue commented Sep 26, 2024

Not all of these changes will go in, but I fixed issues and clarified as I went through the PR.

I also moved certain sections closer to where the information is relevant. I added "First Row ID Inheritance" and "First Row ID Assignment" sections, for example, attached to the data file metadata and manifest file metadata sections.

and explicitly written when the row is moved to a new file.
* `_last_update` the sequence number of the commit which last updated this row. The value is computed via inheritance for
rows in their original file or in files where the row was modified.
* `_row_id` a unique long for every row. Assigned via inheritance when a row is added and the existing value is explicitly written when the row is written to a new file.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think "added" here is ambigous, which is why I had newly created. Perhaps "first added to the table".

I think the other half of the sentence is also a bit confusing grammatically to me. I'll try to do a little rewording here.

sum of the `total_records` of every previous added `data_file` entry summed with the`first_row_id` of the manifest.
Row lineage fields are written when row lineage is enabled. When not enabled, row lineage fields (`_row_id` and `_last_update`) must not be written to data files. The rest of this section applies when row lineage is enabled.

When a row is added or modified, the `_last_update` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires always adding the fields when writing and I was convinced that this is probably not necessary for what we are doing. If we don't have this requirement it becomes very easy for writers which generate parquet files without these fields to participate in row lineage and only metadata writers need to be modified (except in the case of row level operations). The other benefit is that files can be added to the table from another source (like hive) and still ahve row lineage work.

On read, if `_last_update` is `null` it is assigned the `sequence_number` of the data file's manifest entry. The data sequence number of a data file is documented in [Sequence Number Inheritance](#sequence-number-inheritance).

Given a original metadata
When `null`, a row's `_row_id` field is assigned to the `start_row_id` from its containing data file plus the row position in that data file (`_pos`). A data file's `start_row_id` field is assigned using inheritance and is documented in [First Row ID Inheritance](#first-row-id-inheritance). A manifest's `start_row_id` is assigned when writing the manifest list for a snapshot and is documented in [First Row ID Assignment](#first-row-id-assignment). A snapshot's `start-row-id` is to the table's `next-row-id` and is documented in [Snapshot Row IDs](#snapshot-row-ids).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a verb I think in the last sentence. I think assigned belongs there but I can make the change.

"last-row-id": 1000
}
```
Values for `_row_id` and `_last_update` are either read from a data file or assigned at read time. As a result, rows in a table always have non-null values for these fields when lineage is enabled.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gonna swap "a data file" for "the data file".

@RussellSpitzer RussellSpitzer merged commit bcfd8f7 into RussellSpitzer:RowLineage Sep 27, 2024
RussellSpitzer added a commit that referenced this pull request Sep 23, 2025
# This is the 1st commit message:

Parquet, Core: Allows Internal Parquet Readers to use Custom Types

# This is the commit message #2:

Remove some test code that is no longer in use

# This is the commit message #3:

Add Preconditions for double setting Parquet

# This is the commit message #4:

Reviewer Comments

# This is the commit message #5:

Reviewer Comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants