[SPARK-4809] Rework Guava library shading. #3658

vanzin · 2014-12-10T02:03:57Z

The current way of shading Guava is a little problematic. Code that
depends on "spark-core" does not see the transitive dependency, yet
classes in "spark-core" actually depend on Guava. So it's a little
tricky to run unit tests that use spark-core classes, since you need
a compatible version of Guava in your dependencies when running the
tests. This can become a little tricky, and is kind of a bad user
experience.

This change modifies the way Guava is shaded so that it's applied
uniformly across the Spark build. This means Guava is shaded inside
spark-core itself, so that the dependency issues above are solved.
Aside from that, all Spark sub-modules have their Guava references
relocated, so that they refer to the relocated classes now packaged
inside spark-core. Before, this was only done by the time the assembly
was built, so projects that did not end up inside the assembly (such
as streaming backends) could still reference the original location
of Guava classes.

The Guava classes are added to the "first" artifact Spark generates
(network-common), so that all downstream modules have the needed
classes available. Since "network-common" is a dependency of spark-core,
all Spark apps should get the relocated classes automatically.

The current way of shading Guava is a little problematic. Code that depends on "spark-core" does not see the transitive dependency, yet classes in "spark-core" actually depend on Guava. So it's a little tricky to run unit tests that use spark-core classes, since you need a compatible version of Guava in your dependencies when running the tests. This can become a little tricky, and is kind of a bad user experience. This change modifies the way Guava is shaded so that it's applied uniformly across the Spark build. This means Guava is shaded inside spark-core itself, so that the dependency issues above are solved. Aside from that, all Spark sub-modules have their Guava references relocated, so that they refer to the relocated classes now packaged inside spark-core. Before, this was only done by the time the assembly was built, so project that did not end up inside the assembly (such as streaming backends) could still reference the original location of Guava classes). This relocation does not apply to the sub-modules under network/, though. For those cases, we want to keep the Guava dependency alive, since we want to use the same Guava as the rest of the Yarn NM when deploying the auxiliary shuffle service. For this reason, also, the network/ dependencies are shaded into the spark-core artifact too, so that the raw Guava dependency doesn't leak.

vanzin · 2014-12-10T02:04:51Z

I ran the maven build + unit tests, and also audited the jar files in the build using the tool I wrote:
https://gist.github.com/vanzin/bd9057fadf4a296220b7

I'll run some actual jobs tomorrow.

SparkQA · 2014-12-10T03:34:03Z

Test build #24285 has finished for PR 3658 at commit 4a4ed42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ash211 · 2014-12-10T04:17:39Z

cc @mccheah

vanzin · 2014-12-10T04:49:20Z

BTW, credit where credit is due, I got this idea from @arahuja

SparkQA · 2014-12-25T00:28:18Z

Test build #24792 has finished for PR 3658 at commit 40b8723.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: streaming/pom.xml

SparkQA · 2015-01-07T04:48:04Z

Test build #25133 has finished for PR 3658 at commit 8053dd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: examples/pom.xml

SparkQA · 2015-01-13T02:05:35Z

Test build #25442 has finished for PR 3658 at commit f78c48a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-01-13T19:34:42Z

Test build #25477 has finished for PR 3658 at commit 941848f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- val classServer = new HttpServer(conf, outputDir, new SecurityManager(conf), classServerPort, "HTTP class server")

vanzin · 2015-01-13T21:49:02Z

@pwendell would appreciate some input here. Thanks!

vanzin · 2015-01-23T18:07:28Z

Ping.

pwendell · 2015-01-26T21:17:41Z

core/pom.xml

Is this going to inline other modules (spark-network-common and spark-network-shuffle) in the published spark-core jar?

Yes. From the first comment:

For this reason, also, the network/ dependencies are shaded into the spark-core artifact too, so that the raw Guava dependency doesn't leak.

pwendell · 2015-01-26T21:42:26Z

Hey so a couple questions about this. This does make things simpler overall in terms of the build. Some higher level questions:

This adds 1,500 class files and about 2 MB or a 30% in file size to the Spark core jar. Does this suggest any limits on how many overall dependencies we can shade in Spark longer term (i.e. in addition to guava)?
Inlining two of our own sub-projects into the spark-core jar seems a bit strange. Why not just have the shaded guava in the little assembly jar we create for YARN? Is there a reason we must use the guava dependency that is provided on the cluster? If we did take the current approach, does this mean that Spark core as advertised in maven central won't depend on Spark's network modules... would we still publish the network modules then?

vanzin · 2015-01-26T21:51:07Z

Does this suggest any limits on how many overall dependencies we can shade in Spark longer term (i.e. in addition to guava)?

If we choose relocation as the official way of fixing potential library version conflicts, we'll end up with a growing spark-core. I don't think there's a limit to how many classes you can add (SPARK-1520 notwithstanding), but it doesn't look very pretty, I agree. I did this mainly to fix some issues with the current way Guava is shaded, though, not as a blueprint for how to fix dependency issues going forward.

(It also can confuse certain IDEs that automatically add import statements...)

Why not just have the shaded guava in the little assembly jar we create for YARN?

That's definitely an option. Should be pretty easy to do if that's the preferred way, but I remember that it was a conscious choice to depend on the Yarn-provided Guava.

would we still publish the network modules then?

Do they provide any public APIs? If they do, then yes. Otherwise, there would be no need to publish them.

pwendell · 2015-01-26T22:15:59Z

Yeah, so I think it would be better to just shade guava in YARN as well. The main dependency-related constraint was that we didn't want Spark's yarn shuffle service to have any external dependencies (i.e. it should work well out of the box with YARN). But having an inlined shaded dependency doesn't matter in this regard. So I don't see any other compelling reason to use YARN's guava.

Overall is easier to understand what is going on if we don't have a special case there. And inlining things that we expose as semi-public classes (there are some developer API's in those modules, IIRC), it's a bit off from the normal maven model. So for all these reasons I think it would be good to just shade it in that jar as well.

SparkQA · 2015-01-26T23:05:49Z

Test build #26116 has finished for PR 3658 at commit b3104fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2015-01-27T01:42:26Z

Hey @vanzin mind bringing this up to date now (I merged #3874).

vanzin · 2015-01-27T19:25:17Z

Hey @pwendell, I'll try to get to this soon. But I wanted to get your feedback on my idea for fixing the network/ dependencies thing before I try to implement it.

The way I see it, the cleanest way is to do the Guava shading in the earliest artifact possible; that would be network/common. So that artifact would have the honor of providing all the relocated Guava classes to everyone. Since spark-core depends on it, everything should work out.

The only downside I see to that is that network/common would now expose Optional and friends when it's not really its fault (spark-core demands it).

What do you think?

pwendell · 2015-01-27T23:03:26Z

@vanzin yeah this is a fair point, this would mean that network/common would expose the (un-shaded) Guava classes... a bit clunky. If those classes change in the future we could get into trouble. Still, I think this is the best possible solution, provided it's well documented. If we find that Guava changes this in the future we can consider other options.

Conflicts: assembly/pom.xml core/pom.xml

This makes all Spark-generated artifacts use the Guava defined in the build file at runtime too. The leaked classes ("Optional" and friends) are now exposed in a few places where they weren't before, but hopefully that won't cause issues.

pwendell · 2015-01-28T01:36:31Z

LGTM pending tests. @vanzin do you have more work you'd like to do on this or are you happy with it?

SparkQA · 2015-01-28T02:16:56Z

Test build #26199 has finished for PR 3658 at commit 3c93e42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-01-28T03:09:37Z

No, I'm done with it. Thanks for taking a look.

caneGuy · 2017-07-21T02:47:34Z

streaming/pom.xml

+
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-shade-plugin</artifactId>


why add this?

vanzin mentioned this pull request Dec 17, 2014

[SPARK-4034]change the scope of guava to compile #2876

Closed

Merge branch 'master' into SPARK-4809

40b8723

Marcelo Vanzin added 2 commits January 6, 2015 19:09

Add fix for SPARK-5052 (PR apache#3874).

107d7da

Merge branch 'master' into SPARK-4809

8053dd4

Conflicts: streaming/pom.xml

Merge branch 'master' into SPARK-4809

f78c48a

Conflicts: examples/pom.xml

Merge branch 'master' into SPARK-4809

941848f

pwendell reviewed Jan 26, 2015
View reviewed changes

Add comment.

b3104fc

pwendell mentioned this pull request Jan 26, 2015

[SPARK-5052] Add common/base classes to fix guava methods signatures. #3874

Closed

Marcelo Vanzin added 2 commits January 27, 2015 16:14

Merge branch 'master' into SPARK-4809

5d69ec9

Conflicts: assembly/pom.xml core/pom.xml

Shade Guava in the network-common artifact.

3c93e42

This makes all Spark-generated artifacts use the Guava defined in the build file at runtime too. The leaked classes ("Optional" and friends) are now exposed in a few places where they weren't before, but hopefully that won't cause issues.

asfgit closed this in 37a5e27 Jan 28, 2015

vanzin deleted the SPARK-4809 branch February 12, 2015 00:01

caneGuy reviewed Jul 21, 2017

View reviewed changes

[SPARK-4809] Rework Guava library shading. #3658

[SPARK-4809] Rework Guava library shading. #3658

Uh oh!

Conversation

vanzin commented Dec 10, 2014

Uh oh!

vanzin commented Dec 10, 2014

Uh oh!

SparkQA commented Dec 10, 2014

Uh oh!

ash211 commented Dec 10, 2014

Uh oh!

vanzin commented Dec 10, 2014

Uh oh!

SparkQA commented Dec 25, 2014

Uh oh!

SparkQA commented Jan 7, 2015

Uh oh!

SparkQA commented Jan 13, 2015

Uh oh!

SparkQA commented Jan 13, 2015

Uh oh!

vanzin commented Jan 13, 2015

Uh oh!

vanzin commented Jan 23, 2015

Uh oh!

pwendell Jan 26, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Jan 26, 2015

Choose a reason for hiding this comment

Uh oh!

pwendell commented Jan 26, 2015

Uh oh!

vanzin commented Jan 26, 2015

Uh oh!

pwendell commented Jan 26, 2015

Uh oh!

SparkQA commented Jan 26, 2015

Uh oh!

pwendell commented Jan 27, 2015

Uh oh!

vanzin commented Jan 27, 2015

Uh oh!

pwendell commented Jan 27, 2015

Uh oh!

pwendell commented Jan 28, 2015

Uh oh!

SparkQA commented Jan 28, 2015

Uh oh!

vanzin commented Jan 28, 2015

Uh oh!

caneGuy Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants