Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 20 additions & 4 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -249,9 +249,22 @@
<parquet.test.deps.scope>test</parquet.test.deps.scope>

<!--
These default to Hadoop 3.x shaded client/minicluster jars, but are switched to hadoop-client
when the Hadoop profile is hadoop-2.7, because these are only available in 3.x. Note that,
as result we have to include the same hadoop-client dependency multiple times in hadoop-2.7.
These default to Hadoop 3.x shaded client/minicluster jars, but since the shaded jars are
only available in 3.x, we have to switch to non-shaded Hadoop client jars when the active
profile is hadoop-2.7.

To make the above work, in the hadoop-2.7 profile section we bind hadoop-client-api to
hadoop-client and hadoop-client-runtime to hadoop-yarn-api, which is a dependency of the
former. This effectively moves the hadoop-yarn-api to the same level as hadoop-client, but
should be fine since both dependencies are required. We cannot use hadoop-client for both
of these because the maven enforcer plugin bans duplicate dependencies.

We still leave hadoop-client-minicluster to use hadoop-client because it is of test scope,
and is only used by spark-yarn module. To avoid the duplicate dependency issue we only
declare the dependency when we're using hadoop-3.2 profile, in
resource-managers/yarn/pom.xml.

Please check SPARK-36835 for more details.
-->
<hadoop-client-api.artifact>hadoop-client-api</hadoop-client-api.artifact>
<hadoop-client-runtime.artifact>hadoop-client-runtime</hadoop-client-runtime.artifact>
Expand Down Expand Up @@ -3272,8 +3285,11 @@
<hadoop.version>2.7.4</hadoop.version>
<curator.version>2.7.1</curator.version>
<commons-io.version>2.4</commons-io.version>
<!--
the declaration site above of these variables explains why we need to re-assign them here
-->
<hadoop-client-api.artifact>hadoop-client</hadoop-client-api.artifact>
<hadoop-client-runtime.artifact>hadoop-client</hadoop-client-runtime.artifact>
<hadoop-client-runtime.artifact>hadoop-yarn-api</hadoop-client-runtime.artifact>
Copy link
Contributor

@JoshRosen JoshRosen Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, this is a clever fix:

Instead of the hadoop-2.7 profile resulting in a duplicate direct dependency on hadoop-client, we now just declare an explicit dependency on one of hadoop-client's transitive dependencies (hadoop-yarn-api in this case). Anything which depends on hadoop-client-runtime.artifact must also depend on hadoop-client-api.artifact, so this doesn't end up changing the set of dependencies pulled in.

It looks like we didn't need to do that for hadoop-client-minicluster.artifact because that's only used in the resource-managers/yarn POM and that's already using Maven profiles to control the dependency selection (so the other workaround is fairly non-invasive in that context). In principle, though, I guess we could have changed that to some other transitive dep.


Could you maybe add a one or two line comment above these Hadoop 2.7 lines to explain what's going on? And maybe edit the comment at

spark/pom.xml

Lines 251 to 255 in d73562e

<!--
These default to Hadoop 3.x shaded client/minicluster jars, but are switched to hadoop-client
when the Hadoop profile is hadoop-2.7, because these are only available in 3.x. Note that,
as result we have to include the same hadoop-client dependency multiple times in hadoop-2.7.
-->
to reflect this change? This fix is clever but a little subtle, so I think a comment calling it out (and maybe mentioning SPARK-36835 might help future readers.

Edit: could you also update the PR description to reflect this final fix?

Copy link
Member Author

@sunchao sunchao Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. Yes I think it's better to apply the same for hadoop-client-minicluster.artifact. Let me try that, and perhaps we won't need the changes in YARN's pom.xml with this.

The side effect for this is seems to be that it affects the distance of these dependencies to the root module and thus may make a difference when maven tries to resolve a dependency with multiple versions (see here for reference). I was using hadoop-common (which carries lots of dependencies) instead of hadoop-yarn-api and it was not able to compile.

Will update PR description and the comment in the above pom.xml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it may not be so useful to change hadoop-client-minicluster.artifact since it is test scope while the other two are compile scope by default. For some reason it also changes dev/deps/spark-deps-hadoop-2.7-hive-2.3 when I set it to something like hadoop-mapreduce-client-jobclient.

<hadoop-client-minicluster.artifact>hadoop-client</hadoop-client-minicluster.artifact>
</properties>
</profile>
Expand Down
24 changes: 12 additions & 12 deletions resource-managers/yarn/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,18 @@
<activeByDefault>true</activeByDefault>
</activation>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>${hadoop-client-runtime.artifact}</artifactId>
<version>${hadoop.version}</version>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>${hadoop-client-minicluster.artifact}</artifactId>
<version>${hadoop.version}</version>
<scope>test</scope>
</dependency>
<!-- Used by MiniYARNCluster -->
<dependency>
<groupId>org.bouncycastle</groupId>
Expand Down Expand Up @@ -127,18 +139,6 @@
<artifactId>${hadoop-client-api.artifact}</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>${hadoop-client-runtime.artifact}</artifactId>
<version>${hadoop.version}</version>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>${hadoop-client-minicluster.artifact}</artifactId>
<version>${hadoop.version}</version>
<scope>test</scope>
</dependency>

<!-- Explicit listing of transitive deps that are shaded. Otherwise, odd compiler crashes. -->
<dependency>
Expand Down