-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28903][STREAMING][PYSPARK][TESTS] Fix AWS JDK version conflict that breaks Pyspark Kinesis tests #25559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
b283338
Test updating Kinesis deps and current state of Kinesis Python tests
srowen fdceda1
Undo Kinesis version change
srowen 3cc5f64
More debugging
srowen 0fbfdf7
Oops
srowen d8d0e59
Try without pypy
srowen 361c90d
A few more guesses
srowen 6ae413d
Revert many changes and try excluding aws-java-sdk
srowen e760ebb
Revert default
srowen 101c4ce
Try just dependencyManagement
srowen 78c4e62
Try managing version in dependencies
srowen 24bea1e
Retry excluding the AWS SDK
srowen 36830b4
Remove TODOs
srowen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemptively CCing @steveloughran for a look at this. The TL;DR is that
hadoop-cloudis brining in an oldaws-java-sdkdependency to the assembly and it interferes with the Kinesis dependencies, which are newer. Excluding these is a bit extreme, but, theaws-java-sdkdependency brings in like 20 other AWS JARs. I'm not clear whether that's the intent anyway.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't you break the
hadoop-cloudprofile by doing this?The kinesis integration is not packaged as part of the Spark distribution (when you enable its profile), while
hadoop-cloudis.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the thing. Right now we only pull in the core
aws-java-sdkJAR. If I includeaws-java-sdkas an explicit dependency, it pulls in tons of other dependencies that seem irrelevant to Spark. Hm, maybe I need to use<dependencyManagement>to more narrowly manage up the version ofaws-java-sdkwithout affecting the transitive dependency resolution? Well, if this change works, at least we are on to the cause, and then I'll try that.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.7.4 is a really old version; hadoop 2.9+ uses a (fat) shaded JAR which has a consistent kinesis SDK in with it; 2.8 is on a 1.10.x I think
Go on, move off Hadoop 2.7 as a baseline. It's many years old. EOL/unsupported and never actually qualified against Java 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @steveloughran -- so, given that we are for better or worse here still on Hadoop 2.7 (because I think I need to back port this to 2.4 at least), is it safe to exclude the whole aws-java-sdk dependency? doesn't seem so as it would mean the user has to re-include it. But is it safe to assume they would be running this on Hadoop anyway?
Sounds like you are saying that in Hadoop 2.9, this dependency wouldn't exist or could be excluded.
So, excluding it definitely worked to solve the problem. Right now I'm seeing what happens if we explicitly manage its version up as a direct dependency because just managing it up with
<dependencyManagement>wasn't enough. The downside is probably that the assembly brings in everything theaws-java-sdkdepends on, which is a lot of stuff. We don't distribute the assembly per se (right?) so it doesn't really mean more careful checks of the license of all the dependencies.Still, if somehow it were fine to exclude this dependency, that's the tidiest from Spark's perspective. Does that fly for Hadoop 2.7 or pretty well break the point of
hadoop-cloud?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to excluding the AWS dependency. It is not actually something you can bundle into ASF releases anyway https://issues.apache.org/jira/browse/HADOOP-13794. But: it'd be good for a spark-hadoop-cloud artifact to be published with that dependency for downstream users, or at least the things you have to add documented somewhere.
FWIW, I do build and test the spark kinesis module as part of my AWS SDK update process -one that actually went pretty smoothly for a change last time. No regressions, no new error messages in logs, shaded JARs really are shaded, etc. This is progress and means that backporting is something we should be doing
see https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md#-qualifying-an-aws-sdk-update for the runbook there