Skip to content

Conversation

@RexXiong
Copy link
Contributor

@RexXiong RexXiong commented Apr 7, 2022

What is the purpose of the pull request

This PR specify parquet version for hudi-hadoop-mr-bundle module, used to solve the conflict problem for hive that hudi-hadoop-mr-bundle will include 1.12.2 version of parquet-avro when -Dspark3 is used

Brief change log

  • specify parquet version when compile hudi using -Dspark3 for hudi-hadoop-mr-bundle

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@RexXiong
Copy link
Contributor Author

RexXiong commented Apr 7, 2022

@xushiyan do you have time to look at this pr?

@xushiyan xushiyan self-assigned this Apr 7, 2022
<properties>
<checkstyle.skip>true</checkstyle.skip>
<main.basedir>${project.parent.basedir}</main.basedir>
<parquet.version>${hive.parquet.version}</parquet.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be put under hudi-hadoop-mr/pom.xml instead ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-avro is defined as provided scope in parent pom (hudi-hadoop-mr/pom.xml -> hudi/pom.xml). it needs to be defined here for the compile time version to take effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with @xushiyan this will make the packaging inconsistent. With this change,

  1. hudi-hadoop-mr will have parquet-avro version 1.12
  2. hudi-hadoop-mr-bundle will have version 1.10
  3. hudi-spark3-bundle will have version 1.12
    if all of them are in the classpath, there could be conflict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spark3.2, parquet version 1.12 is a must.
@RexXiong Can you run mvn dependency:tree for all the above modules with this change and confirm the versions of parquet-avro in each of them? Our goal is to be as consistent as possible. perhaps, we could create a separate profile for hive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codope The results of mvn dependency:tree are consistent with the results you discussed, so for classpath consistency, I agree with the idea of ​​creating a separate profile for hive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with @xushiyan we tend to shade the parquet-*(include parquet-avro,parquet-column,parquet-common...) within hadoop-mr-bundle, because increasing the hive profile also increases the complexity for users, @codope

@xushiyan xushiyan added the priority:blocker Production down; release blocker label Apr 7, 2022
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does solve problem for hadoop-mr-bundle for hive query, i concern about the spark-bundle itself, which also includes hadoop-mr.

For spark-bundle built with spark3 profile, we use parquet 1.12, should we shade the parquet-avro within hadoop-mr-bundle and just include hadoop-mr-bundle in spark-bundle?

Let me know your thoughts @RexXiong @codope

@yihua
Copy link
Contributor

yihua commented Apr 8, 2022

There is a separate effort to address Hudi's compatibility with Hadoop, Hive, and Spark 3.x altogether. Please check this branch which is WIP: https://github.com/rahil-c/hudi/commits/rchertar/hdp-3-spark-3

@yihua
Copy link
Contributor

yihua commented Apr 8, 2022

cc @rahil-c

@rahil-c
Copy link
Collaborator

rahil-c commented Apr 9, 2022

I think for my pr https://github.com/rahil-c/hudi/commits/rchertar/hdp-3-spark-3 I had discussed with @yihua the approach of defining parquet 0.10.x in hudi hadoop mr and hudi hadoop mr bundle because thats what hive 3.x expects https://github.com/apache/hive/blob/rel/release-3.1.2/pom.xml#L191
However if a person is using hive 2.x, the parquet version expected is 1.8.x https://github.com/apache/hive/blob/rel/release-2.3.1/pom.xml#L182 so I think instead of defining a version I think shading should be fine.

@RexXiong
Copy link
Contributor Author

@xushiyan I have updated the pr for shading parquet-* within hudi-hadoop-mr-bundle cc @rahil-c @codope

@RexXiong RexXiong changed the title [HUDI-3817] specify parquet version for hudi-hadoop-mr-bundle when compile hudi using -Dspark3 [HUDI-3817] shade parquet dependency for hudi-hadoop-mr-bundle Apr 11, 2022
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as discussed

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xushiyan xushiyan merged commit 5c41e30 into apache:master Apr 11, 2022
rahil-c pushed a commit to rahil-c/hudi that referenced this pull request Apr 11, 2022
xushiyan pushed a commit that referenced this pull request Apr 14, 2022
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan Bundling parquet for just the one engine, is bit inconsistent with our model so far. which has been to not bundle spark, hadoop, parquet


<!-- Parquet -->
<include>org.apache.parquet:parquet-avro</include>
<include>org.apache.parquet:parquet-hadoop-bundle</include>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means that we are now bundling parquet for hive. not just parquet-avro. Should we fix the Mr bundle's parquet-avro version alone instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two proposals:
1)The read and write engines (according to spark parquet-avro version)use the same version which is the meaning of this patch
2)The read engines such as hive use their own parquet-* version

for hive2 parquet-hadoop version is 1.8.1, hive3 parquet-hadoop version is 1.10.0, which is not compatible with the version of parquet-avro.

So the second solution may take 1.8.1 for hive2, 1.10.0 for hive3 ,but has also a bit inconsistent with the write engines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RexXiong if we enforce parquet-avro version using a different variable say hive.parquet.version to always 1.10.x, and shade it in hadoop-mr-bundle, this should be ok ? Would parquet-avro 1.10.x work with parquet-hadoop 1.8.1 ? if so, we don't have to create hive profiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xushiyan test seems parquet-avro 1.10.x is compatible with parquet-hadoop 1.8.1. So I will specify the parquet version of parquet-avro for hadoop-mr-bundle, and this solution was also the first proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

No open projects
Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants