Skip to content
5 changes: 5 additions & 0 deletions hadoop-cloud/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,11 @@
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
</exclusion>
<!-- Keep old SDK out of the assembly to avoid conflict with Kinesis module -->
<exclusion>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemptively CCing @steveloughran for a look at this. The TL;DR is that hadoop-cloud is brining in an old aws-java-sdk dependency to the assembly and it interferes with the Kinesis dependencies, which are newer. Excluding these is a bit extreme, but, the aws-java-sdk dependency brings in like 20 other AWS JARs. I'm not clear whether that's the intent anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't you break the hadoop-cloud profile by doing this?

The kinesis integration is not packaged as part of the Spark distribution (when you enable its profile), while hadoop-cloud is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the thing. Right now we only pull in the core aws-java-sdk JAR. If I include aws-java-sdk as an explicit dependency, it pulls in tons of other dependencies that seem irrelevant to Spark. Hm, maybe I need to use <dependencyManagement> to more narrowly manage up the version of aws-java-sdk without affecting the transitive dependency resolution? Well, if this change works, at least we are on to the cause, and then I'll try that.

Copy link
Contributor

@steveloughran steveloughran Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.7.4 is a really old version; hadoop 2.9+ uses a (fat) shaded JAR which has a consistent kinesis SDK in with it; 2.8 is on a 1.10.x I think

Go on, move off Hadoop 2.7 as a baseline. It's many years old. EOL/unsupported and never actually qualified against Java 8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steveloughran -- so, given that we are for better or worse here still on Hadoop 2.7 (because I think I need to back port this to 2.4 at least), is it safe to exclude the whole aws-java-sdk dependency? doesn't seem so as it would mean the user has to re-include it. But is it safe to assume they would be running this on Hadoop anyway?

Sounds like you are saying that in Hadoop 2.9, this dependency wouldn't exist or could be excluded.

So, excluding it definitely worked to solve the problem. Right now I'm seeing what happens if we explicitly manage its version up as a direct dependency because just managing it up with <dependencyManagement> wasn't enough. The downside is probably that the assembly brings in everything the aws-java-sdk depends on, which is a lot of stuff. We don't distribute the assembly per se (right?) so it doesn't really mean more careful checks of the license of all the dependencies.

Still, if somehow it were fine to exclude this dependency, that's the tidiest from Spark's perspective. Does that fly for Hadoop 2.7 or pretty well break the point of hadoop-cloud?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to excluding the AWS dependency. It is not actually something you can bundle into ASF releases anyway https://issues.apache.org/jira/browse/HADOOP-13794. But: it'd be good for a spark-hadoop-cloud artifact to be published with that dependency for downstream users, or at least the things you have to add documented somewhere.

FWIW, I do build and test the spark kinesis module as part of my AWS SDK update process -one that actually went pretty smoothly for a change last time. No regressions, no new error messages in logs, shaded JARs really are shaded, etc. This is progress and means that backporting is something we should be doing

see https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md#-qualifying-an-aws-sdk-update for the runbook there

<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
Expand Down