-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Flink: Preserve row lineage in RewriteDataFiles #14149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/DataFileRewriteRunner.java
Show resolved
Hide resolved
| return createTable("2"); | ||
| return createTable("3"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep tests for V2 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revert this change and only use the table version V3 in lineage test case.
But I have a question, now OperatorTestBase.createTable() use version 2 for default. Should we test the all UT for V3 too ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in another PR. We need to come up with a set of tests which we want to run against multiple table versions to have test coverage for all of the supported spec versions
7cc884c to
0ce3976
Compare
|
Rebase on main since reader #14148 merged. |
|
|
||
| protected static Table createTable() { | ||
| // only test V2 tables as compaction doesn't support V3 with row lineage | ||
| return createTable("2"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take a look at TestHelpers.V3_AND_ABOVE, then we need to transition these tests to that in the long run.
Maybe the "2" should be an int instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run, it does need to be changed to TestHelpers.V3_AND_ABOVE, because there are quite a few test cases involved. Moreover, some features of v3 are actually not supported in Flink and may require compatibility handling. Is it possible to open a separate PR to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that we need to do it in a separate PR. The point here is just to use an int instead of a string to store the version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, have changed now.
|
@mxm: Could you please take a look? |
|
Mark ci fail https://github.com/apache/iceberg/actions/runs/18073439985/job/51426507107?pr=14149 |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
Keep it alive |
| } | ||
|
|
||
| @Test | ||
| void testRewriteUnpartitionedPreserveLineage() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add tests for TestDataFileRewriteRunner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have add a ut for the v3 table in TestDataFileRewriteRunner
e9a593e to
456d673
Compare
|
Merged to main. |
When RewriteDataFiles executes rewrite tasks for a PlannedGroup, if the table is detected to support RowLineage, it rewrites the schema to add
ROW_IDandLAST_UPDATED_SEQUENCE_NUMBER. It then reads the newly addedROW_IDandLAST_UPDATED_SEQUENCE_NUMBERfields and writes the lineage information into the merged DataFiles.This pr is relay on the pr #14148