Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 20, 2025

This removes sections of #11948 that allow row lineage to be enabled or disabled. Instead, row lineage is always on for v3 tables.

Note: This is a work-in-progress and merging should wait until after the discussion about making v3 row lineage always-on concludes.

@github-actions github-actions bot added the core label Mar 20, 2025
boolean rowLineageEnabled,
long nextRowId) {
long nextRowId,
List<MetadataUpdate> changes) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that this PR also restores the convention that changes are passed last to the TableMetadata constructor. This is because changes are accumulated in TableMetadata.Builder and are not part of the metadata definition. I'd prefer not to mix changes into the actual table metadata.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly correct to me except for the noted test removals which need to be pared back a bit.

@github-actions github-actions bot added the API label Mar 24, 2025
@TestTemplate
public void testEnableRowLineageViaPropertyAtTableCreation() {
assumeThat(formatVersion).isGreaterThanOrEqualTo(TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE);
public void testReplace() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this test because DataOperation.REPLACE row counts are now handled correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rolled this back. I'll fix it when I update how first-row-id gets assigned.

@TestTemplate
public void testReplace() {
assumeThat(formatVersion).isGreaterThanOrEqualTo(TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE);
public void testPositionDeletes() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer, I added tests for position and equality deletes. These were previously failing because the record counts from delete manifests were also used. Please take a look at the new test cases and the fix.

.hasMessageContaining("Cannot use row lineage");
}
@Test
public void testSnapshotRowIDValidation() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks the new validations in Snapshot.

"Cannot add a snapshot with a null 'added-rows' field when row lineage is enabled");
Preconditions.checkArgument(
snapshot.addedRows() >= 0,
"Cannot decrease 'last-row-id'. 'last-row-id' must increase monotonically. Snapshot reports %s added rows");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer, I moved these validations in to Snapshot rather than leaving them here. Now this only validates that first-row-id is set for any Snapshot in a v3 table and that the first-row-id is increasing. Both of those can't be checked by the Snapshot itself.

Also, I thought about keeping the checks here rather than moving them to Snapshot so that we can always read metadata, even when written incorrectly. However, when a snapshot is missing first-row-id, it isn't safe to read the snapshot so I think it is okay to fail when reading metadata rather than when using the metadata.

String operation, Map<String, String> summary, List<ManifestFile> manifests) {
if (summary != null) {
long addedRecords =
PropertyUtil.propertyAsLong(summary, SnapshotSummary.ADDED_RECORDS_PROP, 0L);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the summary is defined, we trust the added-records and deleted-records values, which should match the counts calculated below.

I also added a check to handle the DataOperations.REPLACE case. For a replace, the added records should always be 0. If we want, we can also add checks for data file metadata that assert there are no null values for the _row_id column. What do you think @RussellSpitzer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually been thinking about this a bit more and I'm not sure we can do this. If Summary != null we can't assume that Added_Records_Prop is defined. I think if it is null we probably should still fall back to manifest inspection

Copy link
Contributor Author

@rdblue rdblue Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can trust the summary to be correct given that the operation produces it. However, I think it is best not to rely on this and to use actual rows assigned by the writers instead. I'll follow up with a fix.

For now, I've reverted these changes.

this.v1ManifestLocations = null;
this.firstRowId = firstRowId;
this.addedRows = addedRows;
this.addedRows = firstRowId != null ? addedRows : null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be slightly more permissive, I'm allowing addedRows to be passed, but it is only propagated if firstRowId is non-null.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a few style comments but change looks good. Thanks for fixing the DeleteFile handling

@RussellSpitzer
Copy link
Member

Pending Tests and Spec Change vote of course :)

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rdblue , beyond the mailing list discussion, I think the remaining aspect is some of the assertions need to be restored , but looks like this is known!

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the proposed changes LGTM. I actually thought that the Row Lineage stuff went into main and that we could just revert the metadata update API and the TableMetadata updates without breaking anything, but it looks like it went into 1.8.0 (1fcd6a9), so we'll need to address the RevAPI failures and also figure out how we want to deal with the removal of the EnableRowLineage metadata update

@rdblue rdblue closed this Mar 25, 2025
@rdblue rdblue reopened this Mar 25, 2025
}

// otherwise this is a no-op
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected that this method would return this instead of null as otherwise library users using this method would now get a NPE

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I would have expected TableMetadata#enableRowLineage() to return this instead of null, which would potentially cause a NPE for library users already using this method

@rdblue rdblue merged commit 7d7aeb8 into apache:main Apr 7, 2025
77 of 86 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Apr 7, 2025

Now that the vote has passed, I'll merge this. I'm following up in #12672 to add more core functionality so I'll fix the TableMetadata return value in that PR. Thanks for the reviews, @RussellSpitzer, @amogh-jahagirdar, and @nastra!

Fokko added a commit to Fokko/iceberg-python that referenced this pull request Apr 8, 2025
With apache/iceberg#12593 Row lineage is required
and Spark uses Java 1.8.0 that does not write the required `start-row-id`
field in the snapshot.

Therefore, I think it would be good to just use the SNAPSHOT for now
until it gets released.
Fokko added a commit to apache/iceberg-python that referenced this pull request Apr 8, 2025
With apache/iceberg#12593 Row lineage is
required and Spark uses Java 1.8.0 that does not write the required
`first-row-id` field in the snapshot.

Therefore, I think it would be good to just use the SNAPSHOT for now
until it gets released. We can easily revert the PR once that's done.

Closes #1898

<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

# Rationale for this change

# Are these changes tested?

# Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
With apache/iceberg#12593 Row lineage is
required and Spark uses Java 1.8.0 that does not write the required
`first-row-id` field in the snapshot.

Therefore, I think it would be good to just use the SNAPSHOT for now
until it gets released. We can easily revert the PR once that's done.

Closes apache#1898

<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

# Rationale for this change

# Are these changes tested?

# Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants