-
Notifications
You must be signed in to change notification settings - Fork 3k
[WIP] Internal relocated version of Guava #935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Internal relocated version of Guava #935
Conversation
| com.google.errorprone:error_prone_annotations:2.3.3 (2 constraints: 161a2544) | ||
| com.google.flatbuffers:flatbuffers-java:1.9.0 (2 constraints: e5199714) | ||
| com.google.guava:failureaccess:1.0.1 (1 constraints: 140ae1b4) | ||
| com.google.guava:guava:28.0-jre (21 constraints: 88453dad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to have Guava removed entirely from the versions.lock file so I removed it manually. I tried the following unsuccessfully:
- Mark guava in
iceberg-relocate-external-dependenciesascompileOnly(which to my limited understanding of Gradle is the equivalent to Maven'sprovided) - this unfortunately results in the classes not being seen by the shadow plugin so they don't end up in the jar and Guava 28 still ends up inversions.lock` - Only apply
com.palantir.consistent-versionsto other sub-projects (this doesn't seem to work as the plugin refuses to be set anywhere other than on the root project)
Is there some other way to tell the versions plugin to ignore the Guava dependency in iceberg-relocate-external-dependencies? Ideally we'd "ban" Guava entirely to indicate to developers they need to use the relocated version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ./gradlew --write-locks. But I'm not sure that we need to exclude this from our locks. We should only need to exclude it from direct dependencies. As long as it isn't showing up as a transitive dependency of iceberg-relocate-external-dependencies, we should be okay.
We will also need to add a LICENSE and NOTICE file for the shaded Jar, the same way that we do for iceberg-spark-runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more things:
For modules that need a dependency on a different guava version, does excluding the transitive dependency work?
What do you think about keeping guava dependencies for modules that haven't been moved? Then we could migrate one module at a time, which would cut down on commit conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also need to add a LICENSE and NOTICE file for the shaded Jar, the same way that we do for
iceberg-spark-runtime
That's happening on lines 120-123 in build.gradle no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is
./gradlew --write-locks. But I'm not sure that we need to exclude this from our locks. We should only need to exclude it from direct dependencies. As long as it isn't showing up as a transitive dependency oficeberg-relocate-external-dependencies, we should be okay.
That just re-writes the locks, and I think the plugin then sees that this module is using Guava 28 so it puts that in because it considers it a transitive dependency like you say. If in iceberg-hive we put Guava 14 then it still chooses 28 for versions.lock and this overrides the version used in iceberg-hive. My understanding is that this consistent versions thing doesn't let you choose a version per subproject put perhaps there is some magic that can allow that?
Also, this might work if I apply this change to all the subprojects and ensure that none of them is transitively pulling in Guava 28 and then set iceberg-hive to the expected version. So perhaps let's sort out the other questions on this PR about the package name etc. and I can then apply that everywhere and we can take it from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more things:
For modules that need a dependency on a different guava version, does excluding the transitive dependency work?
In order to get this working in our other branch we had to do both. First we shaded Guava everywhere so that the consistent versions plugin wouldn't choose a higher version. Second we then also had to exclude transitive versions of any other version of Guava. Then finally it would end up with the correct one for the iceberg-hive classpath. This was very simple to do with Maven (we tried in a separate project) and didn't require many of these tricks so maybe we're missing some feature in Gradle or consistent versions that allows you to do what you describe?
What do you think about keeping guava dependencies for modules that haven't been moved? Then we could migrate one module at a time, which would cut down on commit conflicts.
I think the problem with the above is that the consistent versions plugin will look at any Guava dependency and then choose the highest version and then enforce that across the entire project and subprojects. We could roll this change out one module at a time but it won't be usable in iceberg-hive until it's been applied everywhere and we can confirm that Guava 28 really is out of the picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other shaded module, we shade the compileOnly configuration and add the dependencies to shade there. That way, the dependencies are not listed in the final artifact. You could try that if you want, but the current setup looks fine to me. I don't see guava in the other modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I expect to have a guava entry in versions.lock because it is listed as a dependency. As long as it is the expected version, we're okay.
build.gradle
Outdated
| apply from: 'tasks.gradle' | ||
| apply from: 'jmh.gradle' | ||
|
|
||
| project(':iceberg-relocate-external-dependencies') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is minimized, we can't use it for Jackson. What about separate modules for both? So this would be iceberg-shaded-guava and the other would be iceberg-shaded-jackson.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth the effort to minimize? If we just put all of Guava (and Jackson in the future) then this could just produce one relocated jar file. It also means one doesn't have to have the extra step of updating the GuavaClasses file every time you rely on a new Guava class.
Alternatively we could create a subproject per "problematic" external library I'm just concerned we'd end up with quite a few of them but on the other hand it would allow you to pick and choose them in other projects based on which ones that project actually uses.
You guys choose, if you want to keep it minimized and one per library then I can rename this accordingly to be just guava for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference the minimized jar file produced here is currently 1.5MB (with just the classes needed by iceberg-api in it, it will grow bigger with the rest added). Guava total jar file is 2.6MB. Not sure the minimize buys you that much TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with not being minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the shaded approach because so many projects have to do this with Guava. Avro and Parquet already do this, so we're already asking people to include 3 copies of Guava.
I also don't think we are going to have to do this very often. Once this is set up we shouldn't be changing it much. And we should avoid adding new libraries that require it. If we end up with just Jackson and Guava, I think it will be fine.
| Bytes.class.getName(); | ||
| MoreExecutors.class.getName(); | ||
| ThreadFactoryBuilder.class.getName(); | ||
| Iterables.class.getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that if we want to use some other Classes, we should first add the class name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we want the Guava jar to be minimised and only contain the classes that are used by Iceberg. The alternative would be to shade and relocate all of Guava and then get rid of this. Since Guava isn't that big I'd personally vote for shading the whole thing and then removing this as an additional step that people have to do when using a new Guava class but the above was the approach suggsted by @rdblue so perhaps there are good reasons to favour keeping the jars smaller.
# Conflicts: # api/src/main/java/org/apache/iceberg/events/Listeners.java
"com.google.common.shaded"
|
@rdblue @rdsr I've just pushed a change that makes the base package for the relocated Guava classes be "com.google.common.shaded" as you suggested. I've only done it for the |
|
Sorry I haven't gotten back to you on this yet, I've been distracted by the release. I should be able to take a look soon. Thanks for your patience! |
build.gradle
Outdated
| apply from: 'tasks.gradle' | ||
| apply from: 'jmh.gradle' | ||
|
|
||
| project(':iceberg-relocate-external-dependencies') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with not being minimized.
build.gradle
Outdated
| include 'NOTICE' | ||
| } | ||
|
|
||
| relocate 'com.google.common', 'com.google.common.shaded' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to trigger a big set of changes, but just wanted to hear what people think about this. Is com.google.common.shaded too generic and can conflict? Should everything we shade be something like iceberg.[original.package.name]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think com.google.common.shaded is too generic. I would go with org.apache.iceberg.[original.package.name] like I originally had and then the classes all line up together in the same top-level folder in the jar file etc. Alternatively one could just do iceberg as the prefix but I think it looks a bit strange as it doesn't really follow the "reverse TLD" packaging pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel strongly about packaging, then I think it's fine to put the classes under the org.apache.iceberg. namespace.
* Swap to shaded guava in common and core modules * Shade in arrow module
|
With the recent commit from @cmathiesen this now works across the whole project but isn't actually correct since many subprojects appear to be depending on Guava transitively so this would need to change. We can move onto that once we have agreement on whether we want to proceed with this approach and what the relocated package name should be. |
|
Could we get another look at this from the committers please? I think we just need answers on two questions:
Once we have a decision on those we can apply the changes to the entire project and check whether this resolves the issues we see when trying to run the InputFormat on Hive. |
|
I’m in favor of |
.baseline/checkstyle/checkstyle.xml
Outdated
| <module name="IllegalImport"> <!-- Java Coding Guidelines: Import the canonical package --> | ||
| <property name="id" value="BanShadedClasses"/> | ||
| <property name="illegalPkgs" value=".*\.(repackaged|shaded|thirdparty)"/> | ||
| <property name="illegalPkgs" value=".*\.(repackaged|thirdparty)"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep this the same and use a different name for the package. That way the rule still applies to anything else that uses "shaded" in the name. What about org.apache.iceberg.relocated.com.google.common...?
build.gradle
Outdated
| tasks.install.dependsOn tasks.shadowJar | ||
|
|
||
| dependencies { | ||
| compile('com.google.guava:guava:28.0-jre') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't specify the version. That should be in versions.props instead.
|
Overall, this looks good to me. Here's a summary of my other comments:
|
|
One more: while this adds LICENSE and NOTICE in configuration, those files don't exist at |
…diaGroup/incubator-iceberg.git into relocated-external-dependencies
|
@rdblue (or anyone else who understands the intricacies of Gradle dependencies and the |
# Conflicts: # api/src/main/java/org/apache/iceberg/Filterable.java # api/src/main/java/org/apache/iceberg/types/TypeUtil.java # core/src/main/java/org/apache/iceberg/FilteredManifest.java # core/src/main/java/org/apache/iceberg/ManifestReader.java # core/src/main/java/org/apache/iceberg/SchemaUpdate.java # core/src/main/java/org/apache/iceberg/avro/AvroIterable.java # core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java # core/src/main/java/org/apache/iceberg/util/ParallelIterable.java # data/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java # mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java # parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java # spark/src/main/java/org/apache/iceberg/spark/data/SparkAvroWriter.java # spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java # spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java # spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java # spark/src/test/java/org/apache/iceberg/TestDataFileSerialization.java
|
This branch now contains all the changes required to use the relocated version of Guava across all subprojects. The only remaining problem is that Guava 28 still shows up on the classpath of subprojects other than |
|
I'm taking a look at this. I think we should try to get this in without worrying about the consistent versions plugin just yet. We will very likely need to disable that for both Spark 3 and Hive modules. I think the problem with the Guava dependency leaking might be caused by depending on the shadow configuration. This still produces the default Jar (no classifier) with normal dependencies. I'm going to ask our Gradle team about this tomorrow when they are in to see if they have recommendations. I'll also open a PR to fix licensing in this. |
Sounds good. I'm fine with merging these changes in (once we have Guava excluded from the other subprojects) and then using a separate PR to deal with the consistent versions issue. We can raise a small PR which reproduces the problem with a simple HiveRunner unit test and then work on fixing that. Good point on why Guava might be leaking out due to the "default" jar, we'll try take a look but if you get some info from the Gradle team that would be great. |
|
I was able to get Next, I took a look at publishing the Jar without a classifier ( The only problem is that changing the default classifier broke publishing. The maven plugin was previously publishing the shadow Jar because its name was the same as the default publication. So all we should need to do is figure out how to publish the shadow Jar in place of the default. Also, I checked the LICENSE and it looks good. I didn't realize you'd updated it. Thanks! We also need to include a NOTICE if you get a chance to add it. |
|
Thanks @rdblue I think I managed to "reverse engineer" the changes you described (fa32ad2). I've also given you write access to our repo in case you want to push changes directly to the branch. So is this the error we now need to fix: ? |
|
We need to fix the publication for this module so that it publishes the shadowJar artifact (no classifier) instead of the default artifact ("empty" classifier). |
|
I tried that option, but IntelliJ doesn't seem to pick it up. The problem with this is that it creates the output of the |
|
Ah, OK, I'm not using IntelliJ so I don't see the problem you're seeing. This is actually better than the above and seems to achieve the same thing but I get the feeling that's what you already tried with the |
|
Is there something that can be done in the Gradle |
|
Yeah, I'm using |
|
Tell me about it. OK, will leave it with you for now. Feel free to push into this branch, you should have the rights. |
|
@massdosage, I couldn't push to the branch, so I opened a PR against it: ExpediaGroup#10 The main problem was that I was trying to detect use of the shadow plugin before it was applied. I just had to move the deploy.gradle script below the projects. That required linking the shadowJar task to a task that exists before publications are configured. Then I also fixed the iceberg-spark-runtime Jar to use the same publication config for consistency. I also fixed a couple other minor things, like merging master and the NOTICE file. |
|
@massdosage, looks like there were more changes since I merged master. Could you merge and push? I'll try to get this in quickly so that we don't need to keep fixing it. |
|
Since this changes so many files, I went ahead and fixed the conflicts and committed it in #1068. That way you don't have to keep modifying it, only to have it broken by the next commit that gets merged. Thanks for fixing this, @massdosage and @cmathiesen! |
|
Thanks @rdblue - makes sense to merge it in as the pain of merge conflicts hasn't been much fun when keeping this up to date. We might need to raise a separate issue and PR for some specifics with Guava when trying to test Hive but we'll tackle that next. |
This draft PR shows one possible way of solving #934 - the second proposal described there but as an internal subproject rather than an external one. It is based on the approach shown at https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml but it also relocates the Guava classes so that they won't clash when used on classpaths which contain another version of Guava.
At the moment I've only applied the changes to the
iceberg-apisubproject and I also haven't re-sorted the imports into the proper alphabetical order so to verify it works you need to do something like./gradlew -x checkstyleTest -x checkstyleMain :iceberg-api:build. I'd obviously fix all this and roll it out across the whole project but would only do that if the community likes this approach.