Skip to content

Conversation

@massdosage
Copy link
Contributor

This draft PR shows one possible way of solving #934 - the second proposal described there but as an internal subproject rather than an external one. It is based on the approach shown at https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml but it also relocates the Guava classes so that they won't clash when used on classpaths which contain another version of Guava.

At the moment I've only applied the changes to the iceberg-api subproject and I also haven't re-sorted the imports into the proper alphabetical order so to verify it works you need to do something like ./gradlew -x checkstyleTest -x checkstyleMain :iceberg-api:build. I'd obviously fix all this and roll it out across the whole project but would only do that if the community likes this approach.

com.google.errorprone:error_prone_annotations:2.3.3 (2 constraints: 161a2544)
com.google.flatbuffers:flatbuffers-java:1.9.0 (2 constraints: e5199714)
com.google.guava:failureaccess:1.0.1 (1 constraints: 140ae1b4)
com.google.guava:guava:28.0-jre (21 constraints: 88453dad)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to have Guava removed entirely from the versions.lock file so I removed it manually. I tried the following unsuccessfully:

  • Mark guava in iceberg-relocate-external-dependencies as compileOnly (which to my limited understanding of Gradle is the equivalent to Maven's provided) - this unfortunately results in the classes not being seen by the shadow plugin so they don't end up in the jar and Guava 28 still ends up in versions.lock`
  • Only apply com.palantir.consistent-versions to other sub-projects (this doesn't seem to work as the plugin refuses to be set anywhere other than on the root project)

Is there some other way to tell the versions plugin to ignore the Guava dependency in iceberg-relocate-external-dependencies? Ideally we'd "ban" Guava entirely to indicate to developers they need to use the relocated version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ./gradlew --write-locks. But I'm not sure that we need to exclude this from our locks. We should only need to exclude it from direct dependencies. As long as it isn't showing up as a transitive dependency of iceberg-relocate-external-dependencies, we should be okay.

We will also need to add a LICENSE and NOTICE file for the shaded Jar, the same way that we do for iceberg-spark-runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things:

For modules that need a dependency on a different guava version, does excluding the transitive dependency work?

What do you think about keeping guava dependencies for modules that haven't been moved? Then we could migrate one module at a time, which would cut down on commit conflicts.

Copy link
Contributor Author

@massdosage massdosage Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need to add a LICENSE and NOTICE file for the shaded Jar, the same way that we do for iceberg-spark-runtime

That's happening on lines 120-123 in build.gradle no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ./gradlew --write-locks. But I'm not sure that we need to exclude this from our locks. We should only need to exclude it from direct dependencies. As long as it isn't showing up as a transitive dependency of iceberg-relocate-external-dependencies, we should be okay.

That just re-writes the locks, and I think the plugin then sees that this module is using Guava 28 so it puts that in because it considers it a transitive dependency like you say. If in iceberg-hive we put Guava 14 then it still chooses 28 for versions.lock and this overrides the version used in iceberg-hive. My understanding is that this consistent versions thing doesn't let you choose a version per subproject put perhaps there is some magic that can allow that?

Also, this might work if I apply this change to all the subprojects and ensure that none of them is transitively pulling in Guava 28 and then set iceberg-hive to the expected version. So perhaps let's sort out the other questions on this PR about the package name etc. and I can then apply that everywhere and we can take it from there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things:

For modules that need a dependency on a different guava version, does excluding the transitive dependency work?

In order to get this working in our other branch we had to do both. First we shaded Guava everywhere so that the consistent versions plugin wouldn't choose a higher version. Second we then also had to exclude transitive versions of any other version of Guava. Then finally it would end up with the correct one for the iceberg-hive classpath. This was very simple to do with Maven (we tried in a separate project) and didn't require many of these tricks so maybe we're missing some feature in Gradle or consistent versions that allows you to do what you describe?

What do you think about keeping guava dependencies for modules that haven't been moved? Then we could migrate one module at a time, which would cut down on commit conflicts.

I think the problem with the above is that the consistent versions plugin will look at any Guava dependency and then choose the highest version and then enforce that across the entire project and subprojects. We could roll this change out one module at a time but it won't be usable in iceberg-hive until it's been applied everywhere and we can confirm that Guava 28 really is out of the picture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other shaded module, we shade the compileOnly configuration and add the dependencies to shade there. That way, the dependencies are not listed in the final artifact. You could try that if you want, but the current setup looks fine to me. I don't see guava in the other modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I expect to have a guava entry in versions.lock because it is listed as a dependency. As long as it is the expected version, we're okay.

build.gradle Outdated
apply from: 'tasks.gradle'
apply from: 'jmh.gradle'

project(':iceberg-relocate-external-dependencies') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is minimized, we can't use it for Jackson. What about separate modules for both? So this would be iceberg-shaded-guava and the other would be iceberg-shaded-jackson.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth the effort to minimize? If we just put all of Guava (and Jackson in the future) then this could just produce one relocated jar file. It also means one doesn't have to have the extra step of updating the GuavaClasses file every time you rely on a new Guava class.

Alternatively we could create a subproject per "problematic" external library I'm just concerned we'd end up with quite a few of them but on the other hand it would allow you to pick and choose them in other projects based on which ones that project actually uses.

You guys choose, if you want to keep it minimized and one per library then I can rename this accordingly to be just guava for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference the minimized jar file produced here is currently 1.5MB (with just the classes needed by iceberg-api in it, it will grow bigger with the rest added). Guava total jar file is 2.6MB. Not sure the minimize buys you that much TBH.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with not being minimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the shaded approach because so many projects have to do this with Guava. Avro and Parquet already do this, so we're already asking people to include 3 copies of Guava.

I also don't think we are going to have to do this very often. Once this is set up we shouldn't be changing it much. And we should avoid adding new libraries that require it. If we end up with just Jackson and Guava, I think it will be fine.

Bytes.class.getName();
MoreExecutors.class.getName();
ThreadFactoryBuilder.class.getName();
Iterables.class.getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if we want to use some other Classes, we should first add the class name here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we want the Guava jar to be minimised and only contain the classes that are used by Iceberg. The alternative would be to shade and relocate all of Guava and then get rid of this. Since Guava isn't that big I'd personally vote for shading the whole thing and then removing this as an additional step that people have to do when using a new Guava class but the above was the approach suggsted by @rdblue so perhaps there are good reasons to favour keeping the jars smaller.

@massdosage massdosage changed the title Internal relocated version of Guava [WIP] Internal relocated version of Guava Apr 24, 2020
# Conflicts:
#	api/src/main/java/org/apache/iceberg/events/Listeners.java
@massdosage
Copy link
Contributor Author

@rdblue @rdsr I've just pushed a change that makes the base package for the relocated Guava classes be "com.google.common.shaded" as you suggested. I've only done it for the iceberg-api package to get approval on approach before rolling it out to the other subprojects. The plus side is that the import changes are smaller, the down side is that I have had to modify the checkstyle rules banning imports of shaded classes as the package name now triggers those to go off. Let me know what you think for next steps.

@rdblue
Copy link
Contributor

rdblue commented Apr 29, 2020

Sorry I haven't gotten back to you on this yet, I've been distracted by the release. I should be able to take a look soon. Thanks for your patience!

build.gradle Outdated
apply from: 'tasks.gradle'
apply from: 'jmh.gradle'

project(':iceberg-relocate-external-dependencies') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with not being minimized.

build.gradle Outdated
include 'NOTICE'
}

relocate 'com.google.common', 'com.google.common.shaded'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to trigger a big set of changes, but just wanted to hear what people think about this. Is com.google.common.shaded too generic and can conflict? Should everything we shade be something like iceberg.[original.package.name]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think com.google.common.shaded is too generic. I would go with org.apache.iceberg.[original.package.name] like I originally had and then the classes all line up together in the same top-level folder in the jar file etc. Alternatively one could just do iceberg as the prefix but I think it looks a bit strange as it doesn't really follow the "reverse TLD" packaging pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel strongly about packaging, then I think it's fine to put the classes under the org.apache.iceberg. namespace.

* Swap to shaded guava in common and core modules
* Shade in arrow module
@massdosage
Copy link
Contributor Author

With the recent commit from @cmathiesen this now works across the whole project but isn't actually correct since many subprojects appear to be depending on Guava transitively so this would need to change. We can move onto that once we have agreement on whether we want to proceed with this approach and what the relocated package name should be.

@massdosage
Copy link
Contributor Author

Could we get another look at this from the committers please? I think we just need answers on two questions:

  1. What package to use for the relocated code (see comments and discussion above). My vote goes for org.apache.iceberg.[original.package.name].
  2. Whether to minimise or not. My vote goes for not.

Once we have a decision on those we can apply the changes to the entire project and check whether this resolves the issues we see when trying to run the InputFormat on Hive.

@jerryshao
Copy link
Contributor

I’m in favor of org.apache.iceberg.[original.package.name], or o.a.i.shaded.[orginal.package.name].

<module name="IllegalImport"> <!-- Java Coding Guidelines: Import the canonical package -->
<property name="id" value="BanShadedClasses"/>
<property name="illegalPkgs" value=".*\.(repackaged|shaded|thirdparty)"/>
<property name="illegalPkgs" value=".*\.(repackaged|thirdparty)"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep this the same and use a different name for the package. That way the rule still applies to anything else that uses "shaded" in the name. What about org.apache.iceberg.relocated.com.google.common...?

build.gradle Outdated
tasks.install.dependsOn tasks.shadowJar

dependencies {
compile('com.google.guava:guava:28.0-jre') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't specify the version. That should be in versions.props instead.

@rdblue
Copy link
Contributor

rdblue commented May 8, 2020

Overall, this looks good to me. Here's a summary of my other comments:

  • I'm fine using org.apache.iceberg.com.google.guava. Let's not include shaded so we by-pass the checkstyle rule and don't have to modify it.
  • I strongly prefer minimizing guava so that we don't over-use Guava (we may want to remove it later) and so that we don't add yet another copy to the runtime classpath
  • I'd like to rename the module to iceberg-bundled-guava or something
  • I don't see Guava in the declared dependencies when I look at ./gradlew iceberg-core:dependencies --configuration compile so I think this does the right thing.
  • The versions.lock file should list Guava since it's a locked version that comes from the new module.

@rdblue
Copy link
Contributor

rdblue commented May 8, 2020

One more: while this adds LICENSE and NOTICE in configuration, those files don't exist at (module-dir)/LICENSE and so they aren't included. The LICENSE should contain the Apache License and an entry for Guava, like the one for spark-runtime. No other content is needed because Guava is the only library in the Jar.

@massdosage
Copy link
Contributor Author

@rdblue (or anyone else who understands the intricacies of Gradle dependencies and the consistent-versions plugin). I think what we're aiming for here is to have Guava completely removed from the compile classpath so that developers can't import anything from com.google directly but have to go via the new relocated jar file instead. However, even with the changes in this PR up to now it seems one can still refer to the Guava classes directly without the compile step failing (see https://github.com/ExpediaGroup/incubator-iceberg/blob/relocated-external-dependencies/core/src/main/java/org/apache/iceberg/avro/AvroIterable.java#L22). From what I can see if the consistent versions plugin and/or Gradle itself sees any subproject using Guava, it adds this to the project classpath for all the subprojects. Is there any way to prevent this? Ultimately what we want is for bundled-guava to depend on Guava 28, all other subprojects to depend on bundled-guava and then mr (or wherever we put the Hive unit tests) to depend on Guava 14.0 (with any other version on the classpath they will fail).

@rdblue rdblue marked this pull request as ready for review May 15, 2020 15:59
cmathiesen and others added 4 commits May 15, 2020 17:36
# Conflicts:
#	api/src/main/java/org/apache/iceberg/Filterable.java
#	api/src/main/java/org/apache/iceberg/types/TypeUtil.java
#	core/src/main/java/org/apache/iceberg/FilteredManifest.java
#	core/src/main/java/org/apache/iceberg/ManifestReader.java
#	core/src/main/java/org/apache/iceberg/SchemaUpdate.java
#	core/src/main/java/org/apache/iceberg/avro/AvroIterable.java
#	core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java
#	core/src/main/java/org/apache/iceberg/util/ParallelIterable.java
#	data/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java
#	mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
#	parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
#	spark/src/main/java/org/apache/iceberg/spark/data/SparkAvroWriter.java
#	spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java
#	spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
#	spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java
#	spark/src/test/java/org/apache/iceberg/TestDataFileSerialization.java
@massdosage
Copy link
Contributor Author

This branch now contains all the changes required to use the relocated version of Guava across all subprojects.

The only remaining problem is that Guava 28 still shows up on the classpath of subprojects other than iceberg-bundled-guava - it should only be visible on this subproject's classpath, none of the other subprojects should be able to "see" Guava (to prevent developers inadvertently using Guava classes directly). Whatever subproject we put the Hive unit tests in also needs to be able to have a different version of Guava on its classpath. As mentioned in the previous comment, we'd appreciate some advice how to achieve this as the consistent-versions plugin seems to be preventing this.

@rdblue
Copy link
Contributor

rdblue commented May 20, 2020

I'm taking a look at this. I think we should try to get this in without worrying about the consistent versions plugin just yet. We will very likely need to disable that for both Spark 3 and Hive modules.

I think the problem with the Guava dependency leaking might be caused by depending on the shadow configuration. This still produces the default Jar (no classifier) with normal dependencies. I'm going to ask our Gradle team about this tomorrow when they are in to see if they have recommendations.

I'll also open a PR to fix licensing in this.

@massdosage
Copy link
Contributor Author

I'm taking a look at this. I think we should try to get this in without worrying about the consistent versions plugin just yet. We will very likely need to disable that for both Spark 3 and Hive modules.

I think the problem with the Guava dependency leaking might be caused by depending on the shadow configuration. This still produces the default Jar (no classifier) with normal dependencies. I'm going to ask our Gradle team about this tomorrow when they are in to see if they have recommendations.

I'll also open a PR to fix licensing in this.

Sounds good. I'm fine with merging these changes in (once we have Guava excluded from the other subprojects) and then using a separate PR to deal with the consistent versions issue. We can raise a small PR which reproduces the problem with a simple HiveRunner unit test and then work on fixing that.

Good point on why Guava might be leaking out due to the "default" jar, we'll try take a look but if you get some info from the Gradle team that would be great.

@rdblue
Copy link
Contributor

rdblue commented May 21, 2020

I was able to get publishToMavenLocal working by making the GuavaClasses public and removing the lines that suppress Javadoc. We have to publish Javadoc for each Jar for an Apache release anyway. Then looking at the maven publication, it looks like the transitive dependency is being correctly suppressed. No dependency on Guava in the POM.

Next, I took a look at publishing the Jar without a classifier (classifer null in shadowJar). That broke downstream modules in the build, but I eventually found that it's because a null classifier causes a conflict with the default artifact. Changing the classifier of the default Jar fixes transitive dependencies in the build. Now, IntelliJ can see the relocated packages and only the relocated packages. Here's the block to update the Jar for bundled Guava:

  jar {
    classifier 'empty'
    enabled = false
  }

The only problem is that changing the default classifier broke publishing. The maven plugin was previously publishing the shadow Jar because its name was the same as the default publication. So all we should need to do is figure out how to publish the shadow Jar in place of the default.

Also, I checked the LICENSE and it looks good. I didn't realize you'd updated it. Thanks! We also need to include a NOTICE if you get a chance to add it.

@massdosage
Copy link
Contributor Author

massdosage commented May 21, 2020

Thanks @rdblue I think I managed to "reverse engineer" the changes you described (fa32ad2). I've also given you write access to our repo in case you want to push changes directly to the branch.

So is this the error we now need to fix:

Invalid publication 'apache': artifact file does not exist: '/home/awoodhead/temp/iceberg-forks/expediagroup/incubator-iceberg/bundled-guava/build/libs/iceberg-bundled-guava-fa32ad2d-empty.jar'

?

@rdblue
Copy link
Contributor

rdblue commented May 21, 2020

We need to fix the publication for this module so that it publishes the shadowJar artifact (no classifier) instead of the default artifact ("empty" classifier).

@massdosage
Copy link
Contributor Author

@rdblue does this do the trick? Looks OK to me and the build works...

@rdblue
Copy link
Contributor

rdblue commented May 22, 2020

I tried that option, but IntelliJ doesn't seem to pick it up.

The problem with this is that it creates the output of the jar task in shadowJar. Both tasks create the same output file. Then it is ambiguous to dependency management what is happening, I think. By using empty for the jar task's classifier, we avoid the problem and IntelliJ is able to correctly build a classpath. But then publishing is broken because we need to turn off publication for jar and turn on publication for shadowJar.

@massdosage
Copy link
Contributor Author

Ah, OK, I'm not using IntelliJ so I don't see the problem you're seeing. This is actually better than the above and seems to achieve the same thing but I get the feeling that's what you already tried with the null classifier so will probably have the same problem in IntelliJ. Hmmm...

@massdosage
Copy link
Contributor Author

Is there something that can be done in the Gradle maven-publish plugin to do what you describe? I know very little about it I'm afraid...

@rdblue
Copy link
Contributor

rdblue commented May 22, 2020

Yeah, I'm using classifier null for shadowJar and classifier 'empty' for jar. That works, but I need to find the right way to configure publishing. I think I'm close to a solution, but Gradle is hard to work with if you need to do anything off of the paved path.

@massdosage
Copy link
Contributor Author

Tell me about it.

OK, will leave it with you for now. Feel free to push into this branch, you should have the rights.

@rdblue
Copy link
Contributor

rdblue commented May 22, 2020

@massdosage, I couldn't push to the branch, so I opened a PR against it: ExpediaGroup#10

The main problem was that I was trying to detect use of the shadow plugin before it was applied. I just had to move the deploy.gradle script below the projects. That required linking the shadowJar task to a task that exists before publications are configured. Then I also fixed the iceberg-spark-runtime Jar to use the same publication config for consistency.

I also fixed a couple other minor things, like merging master and the NOTICE file.

Merge master, tidy up, fix shadow jar publishing
@rdblue
Copy link
Contributor

rdblue commented May 26, 2020

@massdosage, looks like there were more changes since I merged master. Could you merge and push? I'll try to get this in quickly so that we don't need to keep fixing it.

@rdblue rdblue mentioned this pull request May 26, 2020
@rdblue rdblue closed this in #1068 May 26, 2020
@rdblue
Copy link
Contributor

rdblue commented May 26, 2020

Since this changes so many files, I went ahead and fixed the conflicts and committed it in #1068. That way you don't have to keep modifying it, only to have it broken by the next commit that gets merged.

Thanks for fixing this, @massdosage and @cmathiesen!

@massdosage
Copy link
Contributor Author

Thanks @rdblue - makes sense to merge it in as the pain of merge conflicts hasn't been much fun when keeping this up to date. We might need to raise a separate issue and PR for some specifics with Guava when trying to test Hive but we'll tackle that next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants