-
Notifications
You must be signed in to change notification settings - Fork 3k
Flink: fix out of order appends in upsert UT #4532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great. Thank you to all involved @singhpk234 @yittg @openinx
I left some initial comments and will be reviewing more tomorrow.
|
|
||
| @Test | ||
| public void testPrimaryKeyEqualToPartitionKey() { | ||
| // This is an SQL based reproduction of TestFlinkIcebergSinkV2#testUpsertOnDataKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this comment as well while you're updating this? It's no longer applicable really and not hepful.
Thank you!
| "(4, 'a', DATE '2022-03-02')," + | ||
| "(5, 'b', DATE '2022-03-02')," + | ||
| "(1, 'b', DATE '2022-03-02')", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we addd a comment about the usage of DATE vs TO_DATE, given that we're relying on DATE to affect the generated job graph differently than using TO_DATE does, which affects ordering etc?
A short comment and then directing to this comment might be sufficent: #4515 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the commit message is a more proper place to explain it.
I'll split this PR into two parts as you mentioned, and describe the reason in the commit message of the first.
WDYT @kbendick ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. As well as a proper description in the PR comment (which can be the same). As those are the two most commonly searched places when reviewing older things.
Thanks again for the great work.
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yittg Would you mind spitting this into two PRs, one that has the minimum changes required to fix the issue (changing the use of TO_DATE to just DATE and anything else needed), and then another PR to add the dynamic data generation and things?
It would be very beneficial in the long term to have a more clear record of the fix that is needed, without the extra changes. The differences in the two plans generated will be helpful to have a good long term record of outside of just the issue.
Then a second PR can be opened to clean up and better normalize the tests.
Or at the least if you can add a good summary of the changes to the PR description, that would be highly appreciated. 🙂
In Flink, only VALUES with literal tuples can be converted to a single
LogicalValues node. VALUES with other type of tuples will be converted
to a chain with multiple LogicalProject nodes, like following:
LogicalValues -> (# number or rows)LogicalProject -> LogicalUnion.
The `TO_DATE('xxxx-xx-xx')` expression is not a literal node, so the
VALUES used in upsert test cases will be converted in the latter way.
The order of tuples can not be guaranteed, which can lead to unstable
test results.
Changing `TO_DATE('xxxx-xx-xx')` to `DATE 'xxxx-xx-xx'` can make it a
literal node and ensure that the VALUES expression can be converted to a
single LogicalValues node.
|
@kbendick i've changed this PR to contain only the first part. I'll create another PR for the second part after this being merged, if necessary. If the current test data is already enough, we can just drop the second part. :) |
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me ! Thanks @yittg for the contribution !
In Flink, only VALUES with literal tuples can be converted to a single
LogicalValues node. VALUES with other type of tuples will be converted
to a chain with multiple LogicalProject nodes, like following:
The
TO_DATE('xxxx-xx-xx')expression is not a literal node, so theVALUES used in upsert test cases will be converted in the latter way.
The order of tuples can not be guaranteed, which can lead to unstable
test results.
Changing
TO_DATE('xxxx-xx-xx')toDATE 'xxxx-xx-xx'can make it aliteral node and ensure that the VALUES expression can be converted to a
single LogicalValues node.
fixes #4515