-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: Enable row lineage for all v3 tables #12593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| boolean rowLineageEnabled, | ||
| long nextRowId) { | ||
| long nextRowId, | ||
| List<MetadataUpdate> changes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that this PR also restores the convention that changes are passed last to the TableMetadata constructor. This is because changes are accumulated in TableMetadata.Builder and are not part of the metadata definition. I'd prefer not to mix changes into the actual table metadata.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly correct to me except for the noted test removals which need to be pared back a bit.
| @TestTemplate | ||
| public void testEnableRowLineageViaPropertyAtTableCreation() { | ||
| assumeThat(formatVersion).isGreaterThanOrEqualTo(TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE); | ||
| public void testReplace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this test because DataOperation.REPLACE row counts are now handled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rolled this back. I'll fix it when I update how first-row-id gets assigned.
| @TestTemplate | ||
| public void testReplace() { | ||
| assumeThat(formatVersion).isGreaterThanOrEqualTo(TableMetadata.MIN_FORMAT_VERSION_ROW_LINEAGE); | ||
| public void testPositionDeletes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer, I added tests for position and equality deletes. These were previously failing because the record counts from delete manifests were also used. Please take a look at the new test cases and the fix.
| .hasMessageContaining("Cannot use row lineage"); | ||
| } | ||
| @Test | ||
| public void testSnapshotRowIDValidation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks the new validations in Snapshot.
| "Cannot add a snapshot with a null 'added-rows' field when row lineage is enabled"); | ||
| Preconditions.checkArgument( | ||
| snapshot.addedRows() >= 0, | ||
| "Cannot decrease 'last-row-id'. 'last-row-id' must increase monotonically. Snapshot reports %s added rows"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer, I moved these validations in to Snapshot rather than leaving them here. Now this only validates that first-row-id is set for any Snapshot in a v3 table and that the first-row-id is increasing. Both of those can't be checked by the Snapshot itself.
Also, I thought about keeping the checks here rather than moving them to Snapshot so that we can always read metadata, even when written incorrectly. However, when a snapshot is missing first-row-id, it isn't safe to read the snapshot so I think it is okay to fail when reading metadata rather than when using the metadata.
| String operation, Map<String, String> summary, List<ManifestFile> manifests) { | ||
| if (summary != null) { | ||
| long addedRecords = | ||
| PropertyUtil.propertyAsLong(summary, SnapshotSummary.ADDED_RECORDS_PROP, 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the summary is defined, we trust the added-records and deleted-records values, which should match the counts calculated below.
I also added a check to handle the DataOperations.REPLACE case. For a replace, the added records should always be 0. If we want, we can also add checks for data file metadata that assert there are no null values for the _row_id column. What do you think @RussellSpitzer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've actually been thinking about this a bit more and I'm not sure we can do this. If Summary != null we can't assume that Added_Records_Prop is defined. I think if it is null we probably should still fall back to manifest inspection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can trust the summary to be correct given that the operation produces it. However, I think it is best not to rely on this and to use actual rows assigned by the writers instead. I'll follow up with a fix.
For now, I've reverted these changes.
| this.v1ManifestLocations = null; | ||
| this.firstRowId = firstRowId; | ||
| this.addedRows = addedRows; | ||
| this.addedRows = firstRowId != null ? addedRows : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be slightly more permissive, I'm allowing addedRows to be passed, but it is only propagated if firstRowId is non-null.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a few style comments but change looks good. Thanks for fixing the DeleteFile handling
|
Pending Tests and Spec Change vote of course :) |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rdblue , beyond the mailing list discussion, I think the remaining aspect is some of the assertions need to be restored , but looks like this is known!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the proposed changes LGTM. I actually thought that the Row Lineage stuff went into main and that we could just revert the metadata update API and the TableMetadata updates without breaking anything, but it looks like it went into 1.8.0 (1fcd6a9), so we'll need to address the RevAPI failures and also figure out how we want to deal with the removal of the EnableRowLineage metadata update
| } | ||
|
|
||
| // otherwise this is a no-op | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected that this method would return this instead of null as otherwise library users using this method would now get a NPE
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I would have expected TableMetadata#enableRowLineage() to return this instead of null, which would potentially cause a NPE for library users already using this method
|
Now that the vote has passed, I'll merge this. I'm following up in #12672 to add more core functionality so I'll fix the |
With apache/iceberg#12593 Row lineage is required and Spark uses Java 1.8.0 that does not write the required `start-row-id` field in the snapshot. Therefore, I think it would be good to just use the SNAPSHOT for now until it gets released.
With apache/iceberg#12593 Row lineage is required and Spark uses Java 1.8.0 that does not write the required `first-row-id` field in the snapshot. Therefore, I think it would be good to just use the SNAPSHOT for now until it gets released. We can easily revert the PR once that's done. Closes #1898 <!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. -->
With apache/iceberg#12593 Row lineage is required and Spark uses Java 1.8.0 that does not write the required `first-row-id` field in the snapshot. Therefore, I think it would be good to just use the SNAPSHOT for now until it gets released. We can easily revert the PR once that's done. Closes apache#1898 <!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. -->
This removes sections of #11948 that allow row lineage to be enabled or disabled. Instead, row lineage is always on for v3 tables.
Note: This is a work-in-progress and merging should wait until after the discussion about making v3 row lineage always-on concludes.