[0.13] Flink upsert delete file metadata backports #4786

nastra · 2022-05-16T10:25:28Z

This backports #4417 / #3834 (because #4364 depends on it) / #4364 / #4189 (because CI was failing without this test fix) to the 0.13.x branch

nastra · 2022-05-16T11:07:30Z

@kbendick could you please double-check if there are any other Flink changes that need to be backported?

nastra · 2022-05-16T11:10:36Z

flink/v1.12/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java

+
+    @Override
+    protected StructLike asStructLikeKey(RowData data) {
+      return wrapper.wrap(data);


@kbendick should this throw an exception like in

iceberg/flink/v1.12/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java

Line 109 in 910f271

throw new UnsupportedOperationException("Not implemented for Flink 1.12 during PR review");

or does it even matter what we return here for Flink 1.12?

I would keep it this way upon initial inspection.

It needs to be implemented for the API and it will possibly get called. Keeping it the same as asStructLike seems safest for backwards compatibility.

kbendick

+1. I wanted to take some time to verify some of the changes in core didn’t cause issues because we didn’t backport.

TLDR is that it shouldn’t (and we’re emphatically alerting people to upgrade anyway).

kbendick · 2022-05-17T19:31:36Z

I'm not sure what's the deal with Flink 1.13 tests. Maybe the heap change needs to be merged in first? It seems though that the 1.13 tests are just... waiting.

rdblue · 2022-05-17T20:10:51Z

The test issue looks like a github problem. I tried to cancel, but it looks like I can't right now. I'll check back in a few hours. It should time out.

nastra · 2022-05-18T05:52:52Z

Yes the job is timing out but it's unclear why because the run itself doesn't show any output. Locally it helped to increase the Heap size for Gradle (since the build didn't progress as well locally)

kbendick · 2022-05-18T05:57:04Z

So GH Actions use the settings in master for security purposes.

Mayhe merging in the heap increase first will help? I doubt the tests are running with the heap increase if it’s in this PR. Just a guess but the tests didn’t complete without increased heap for me as well.

Also, since this is a release branch, feel free to make the heap slightly bigger and/or tune the forked JVM’s GC settings in the settings.gradle (or wherever).

But I think running with the heap settings already merged in will help the most. I think you can try it in your fork if you run the tests there.

kbendick · 2022-05-18T05:59:41Z

Oh and the other thing would be to add class unloading into the test JVM settings. I helped tune the JVM for CI in Spark and class unloading, a fixed MaxMetaSpaceSize (to force class unloading), and somewhat larger heaps helped quite a bit for a situation that had gotten really out of hand.

Also my only meaningful contribution to the Spark repo and the credit got misattributed as I never added that email to my GitHub 😂

But class unloading really helps due to Flink’s inverted class loader. And out JVM prior to this change (and even still really) is very small in my opinion.

Again, since this is a release branch and not one we’ll work off of ever again very likely, feel free to be liberal in updating the JVM for test purposes! They just need to run enough to make the release!

kbendick · 2022-05-18T06:04:35Z

The only other issue I can think of is a GitHub issue with the JVM cache as the problem is sticky for Flink 1.13.

But the most recent attempt did run somewhat so I don’t think that’s it. I think it’s just poor memory utilization / availability combined with some somewhat heavy Flink SQL upsert tests.

Co-authored-by: liliwei <[email protected]>

… one checkpoint cycle (apache#4189)

nastra · 2022-05-18T10:49:11Z

I think I found the issue (which can easily be debugged locally by adding --no-parallel --debug to get ./gradlew -DsparkVersions= -DhiveVersions= -DflinkVersions=1.13 :iceberg-flink:iceberg-flink-1.13:check :iceberg-flink:iceberg-flink-runtime-1.13:check -Pquick=true -x javadoc --no-parallel --debug).
It looks like something got messed up during conflict resolution and BaseDeltaTaskWriter#asStructLike(..) was accidentally returning keyWrapper.wrap(data) and not wrapper.wrap(data).
This then resulted in TestChangeLogTable#testChangeLogOnDataKey() to run indefinitely in a retry-loop because there was an underlying exception.
For the future I think it would be good to have a timeout on tests as issues like these are quite difficult to debug on CI without modifying how the test is being executed (e.g. by adding --debug to the gradle task and such)

kbendick · 2022-05-19T19:48:53Z

I think I found the issue (which can easily be debugged locally by adding --no-parallel --debug to get ./gradlew -DsparkVersions= -DhiveVersions= -DflinkVersions=1.13 :iceberg-flink:iceberg-flink-1.13:check :iceberg-flink:iceberg-flink-runtime-1.13:check -Pquick=true -x javadoc --no-parallel --debug). It looks like something got messed up during conflict resolution and BaseDeltaTaskWriter#asStructLike(..) was accidentally returning keyWrapper.wrap(data) and not wrapper.wrap(data). This then resulted in TestChangeLogTable#testChangeLogOnDataKey() to run indefinitely in a retry-loop because there was an underlying exception. For the future I think it would be good to have a timeout on tests as issues like these are quite difficult to debug on CI without modifying how the test is being executed (e.g. by adding --debug to the gradle task and such)

+1 to better timeout resolution.

It's possible there's something else we need to cherry-pick given the the underlying exception for that one test case. I'll run locally and we'll see. We can also try to schedule time to talk if you'd like. I'm very interested in getting this out the door.

Though tests do seem to be passing now so potentially you already got to it. 🙂

github-actions bot added build core data flink labels May 16, 2022

nastra mentioned this pull request May 16, 2022

[0.13] Flink 1.13: Fix UPSERT delete file metadata (#4417) #4785

Closed

nastra requested review from kbendick and rdblue May 16, 2022 11:06

nastra commented May 16, 2022

View reviewed changes

nastra added this to the Iceberg 0.13.2 Release milestone May 17, 2022

nastra force-pushed the flink-upsert-delete-file-metadata-backports branch 3 times, most recently from 34edff3 to 275dce6 Compare May 17, 2022 15:05

rdblue approved these changes May 17, 2022

View reviewed changes

kbendick approved these changes May 17, 2022

View reviewed changes

kbendick and others added 4 commits May 18, 2022 12:25

Flink 1.13: Fix UPSERT delete file metadata (apache#4417)

2be397a

Co-authored-by: liliwei <[email protected]>

Flink: Avoid writing duplicate equality deletes (apache#3834)

6dcaa9e

Flink: Fix UPSERT delete file metadata (apache#4364)

75067b5

Co-authored-by: liliwei <[email protected]>

Flink 1.14: Fix flaky testHashDistributeMode by ingesting all rows in…

5ea6b7f

… one checkpoint cycle (apache#4189)

nastra force-pushed the flink-upsert-delete-file-metadata-backports branch from 275dce6 to 5ea6b7f Compare May 18, 2022 10:26

rdblue merged commit f72a15b into apache:0.13.x May 19, 2022

nastra deleted the flink-upsert-delete-file-metadata-backports branch September 27, 2022 06:19

[0.13] Flink upsert delete file metadata backports #4786

[0.13] Flink upsert delete file metadata backports #4786

Uh oh!

Conversation

nastra commented May 16, 2022

Uh oh!

nastra commented May 16, 2022

Uh oh!

nastra May 16, 2022

Choose a reason for hiding this comment

Uh oh!

kbendick May 16, 2022

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

kbendick commented May 17, 2022

Uh oh!

rdblue commented May 17, 2022

Uh oh!

nastra commented May 18, 2022

Uh oh!

kbendick commented May 18, 2022

Uh oh!

kbendick commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented May 18, 2022

Uh oh!

nastra commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kbendick commented May 18, 2022 •

edited

Loading

nastra commented May 18, 2022 •

edited

Loading

kbendick commented May 19, 2022 •

edited

Loading