Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented May 16, 2022

This backports #4417 / #3834 (because #4364 depends on it) / #4364 / #4189 (because CI was failing without this test fix) to the 0.13.x branch

@nastra
Copy link
Contributor Author

nastra commented May 16, 2022

@kbendick could you please double-check if there are any other Flink changes that need to be backported?


@Override
protected StructLike asStructLikeKey(RowData data) {
return wrapper.wrap(data);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbendick should this throw an exception like in

throw new UnsupportedOperationException("Not implemented for Flink 1.12 during PR review");
or does it even matter what we return here for Flink 1.12?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it this way upon initial inspection.

It needs to be implemented for the API and it will possibly get called. Keeping it the same as asStructLike seems safest for backwards compatibility.

@nastra nastra added this to the Iceberg 0.13.2 Release milestone May 17, 2022
@nastra nastra force-pushed the flink-upsert-delete-file-metadata-backports branch 3 times, most recently from 34edff3 to 275dce6 Compare May 17, 2022 15:05
Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I wanted to take some time to verify some of the changes in core didn’t cause issues because we didn’t backport.

TLDR is that it shouldn’t (and we’re emphatically alerting people to upgrade anyway).

@kbendick
Copy link
Contributor

I'm not sure what's the deal with Flink 1.13 tests. Maybe the heap change needs to be merged in first? It seems though that the 1.13 tests are just... waiting.

image

@rdblue
Copy link
Contributor

rdblue commented May 17, 2022

The test issue looks like a github problem. I tried to cancel, but it looks like I can't right now. I'll check back in a few hours. It should time out.

@nastra
Copy link
Contributor Author

nastra commented May 18, 2022

Yes the job is timing out but it's unclear why because the run itself doesn't show any output. Locally it helped to increase the Heap size for Gradle (since the build didn't progress as well locally)

@kbendick
Copy link
Contributor

So GH Actions use the settings in master for security purposes.

Mayhe merging in the heap increase first will help? I doubt the tests are running with the heap increase if it’s in this PR. Just a guess but the tests didn’t complete without increased heap for me as well.

Also, since this is a release branch, feel free to make the heap slightly bigger and/or tune the forked JVM’s GC settings in the settings.gradle (or wherever).

But I think running with the heap settings already merged in will help the most. I think you can try it in your fork if you run the tests there.

@kbendick
Copy link
Contributor

kbendick commented May 18, 2022

Oh and the other thing would be to add class unloading into the test JVM settings. I helped tune the JVM for CI in Spark and class unloading, a fixed MaxMetaSpaceSize (to force class unloading), and somewhat larger heaps helped quite a bit for a situation that had gotten really out of hand.

Also my only meaningful contribution to the Spark repo and the credit got misattributed as I never added that email to my GitHub 😂

But class unloading really helps due to Flink’s inverted class loader. And out JVM prior to this change (and even still really) is very small in my opinion.

Again, since this is a release branch and not one we’ll work off of ever again very likely, feel free to be liberal in updating the JVM for test purposes! They just need to run enough to make the release!

@kbendick
Copy link
Contributor

The only other issue I can think of is a GitHub issue with the JVM cache as the problem is sticky for Flink 1.13.

But the most recent attempt did run somewhat so I don’t think that’s it. I think it’s just poor memory utilization / availability combined with some somewhat heavy Flink SQL upsert tests.

@nastra nastra force-pushed the flink-upsert-delete-file-metadata-backports branch from 275dce6 to 5ea6b7f Compare May 18, 2022 10:26
@nastra
Copy link
Contributor Author

nastra commented May 18, 2022

I think I found the issue (which can easily be debugged locally by adding --no-parallel --debug to get ./gradlew -DsparkVersions= -DhiveVersions= -DflinkVersions=1.13 :iceberg-flink:iceberg-flink-1.13:check :iceberg-flink:iceberg-flink-runtime-1.13:check -Pquick=true -x javadoc --no-parallel --debug).
It looks like something got messed up during conflict resolution and BaseDeltaTaskWriter#asStructLike(..) was accidentally returning keyWrapper.wrap(data) and not wrapper.wrap(data).
This then resulted in TestChangeLogTable#testChangeLogOnDataKey() to run indefinitely in a retry-loop because there was an underlying exception.
For the future I think it would be good to have a timeout on tests as issues like these are quite difficult to debug on CI without modifying how the test is being executed (e.g. by adding --debug to the gradle task and such)

@kbendick
Copy link
Contributor

kbendick commented May 19, 2022

I think I found the issue (which can easily be debugged locally by adding --no-parallel --debug to get ./gradlew -DsparkVersions= -DhiveVersions= -DflinkVersions=1.13 :iceberg-flink:iceberg-flink-1.13:check :iceberg-flink:iceberg-flink-runtime-1.13:check -Pquick=true -x javadoc --no-parallel --debug). It looks like something got messed up during conflict resolution and BaseDeltaTaskWriter#asStructLike(..) was accidentally returning keyWrapper.wrap(data) and not wrapper.wrap(data). This then resulted in TestChangeLogTable#testChangeLogOnDataKey() to run indefinitely in a retry-loop because there was an underlying exception. For the future I think it would be good to have a timeout on tests as issues like these are quite difficult to debug on CI without modifying how the test is being executed (e.g. by adding --debug to the gradle task and such)

+1 to better timeout resolution.

It's possible there's something else we need to cherry-pick given the the underlying exception for that one test case. I'll run locally and we'll see. We can also try to schedule time to talk if you'd like. I'm very interested in getting this out the door.

Though tests do seem to be passing now so potentially you already got to it. 🙂

@rdblue rdblue merged commit f72a15b into apache:0.13.x May 19, 2022
@nastra nastra deleted the flink-upsert-delete-file-metadata-backports branch September 27, 2022 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants