-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3817] shade parquet dependency for hudi-hadoop-mr-bundle #5250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3817] shade parquet dependency for hudi-hadoop-mr-bundle #5250
Conversation
…mpile hudi using -Dspark3
|
@xushiyan do you have time to look at this pr? |
| <properties> | ||
| <checkstyle.skip>true</checkstyle.skip> | ||
| <main.basedir>${project.parent.basedir}</main.basedir> | ||
| <parquet.version>${hive.parquet.version}</parquet.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be put under hudi-hadoop-mr/pom.xml instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-avro is defined as provided scope in parent pom (hudi-hadoop-mr/pom.xml -> hudi/pom.xml). it needs to be defined here for the compile time version to take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed with @xushiyan this will make the packaging inconsistent. With this change,
hudi-hadoop-mrwill have parquet-avro version 1.12hudi-hadoop-mr-bundlewill have version 1.10hudi-spark3-bundlewill have version 1.12
if all of them are in the classpath, there could be conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For spark3.2, parquet version 1.12 is a must.
@RexXiong Can you run mvn dependency:tree for all the above modules with this change and confirm the versions of parquet-avro in each of them? Our goal is to be as consistent as possible. perhaps, we could create a separate profile for hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codope The results of mvn dependency:tree are consistent with the results you discussed, so for classpath consistency, I agree with the idea of creating a separate profile for hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this does solve problem for hadoop-mr-bundle for hive query, i concern about the spark-bundle itself, which also includes hadoop-mr.
For spark-bundle built with spark3 profile, we use parquet 1.12, should we shade the parquet-avro within hadoop-mr-bundle and just include hadoop-mr-bundle in spark-bundle?
|
There is a separate effort to address Hudi's compatibility with Hadoop, Hive, and Spark 3.x altogether. Please check this branch which is WIP: https://github.com/rahil-c/hudi/commits/rchertar/hdp-3-spark-3 |
|
cc @rahil-c |
|
I think for my pr https://github.com/rahil-c/hudi/commits/rchertar/hdp-3-spark-3 I had discussed with @yihua the approach of defining parquet 0.10.x in hudi hadoop mr and hudi hadoop mr bundle because thats what hive 3.x expects https://github.com/apache/hive/blob/rel/release-3.1.2/pom.xml#L191 |
…r hudi-hadoop-mr-bundle
…r hudi-hadoop-mr-bundle
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as discussed
…e#5250) Co-authored-by: lvshuang.xjs <[email protected]>
Co-authored-by: lvshuang.xjs <[email protected]>
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan Bundling parquet for just the one engine, is bit inconsistent with our model so far. which has been to not bundle spark, hadoop, parquet
|
|
||
| <!-- Parquet --> | ||
| <include>org.apache.parquet:parquet-avro</include> | ||
| <include>org.apache.parquet:parquet-hadoop-bundle</include> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change means that we are now bundling parquet for hive. not just parquet-avro. Should we fix the Mr bundle's parquet-avro version alone instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two proposals:
1)The read and write engines (according to spark parquet-avro version)use the same version which is the meaning of this patch
2)The read engines such as hive use their own parquet-* version
for hive2 parquet-hadoop version is 1.8.1, hive3 parquet-hadoop version is 1.10.0, which is not compatible with the version of parquet-avro.
So the second solution may take 1.8.1 for hive2, 1.10.0 for hive3 ,but has also a bit inconsistent with the write engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RexXiong if we enforce parquet-avro version using a different variable say hive.parquet.version to always 1.10.x, and shade it in hadoop-mr-bundle, this should be ok ? Would parquet-avro 1.10.x work with parquet-hadoop 1.8.1 ? if so, we don't have to create hive profiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan test seems parquet-avro 1.10.x is compatible with parquet-hadoop 1.8.1. So I will specify the parquet version of parquet-avro for hadoop-mr-bundle, and this solution was also the first proposed.
What is the purpose of the pull request
This PR specify parquet version for hudi-hadoop-mr-bundle module, used to solve the conflict problem for hive that hudi-hadoop-mr-bundle will include 1.12.2 version of parquet-avro when -Dspark3 is used
Brief change log
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.