Skip to content

Conversation

@vburenin
Copy link
Contributor

What is the purpose of the pull request

parquet-avro 1.10.1 library has a bug that does incorrect complex parquet schema conversion into avro schema.
See https://issues.apache.org/jira/browse/HUDI-1602 for details

Brief change log

parquet-avro library has been upgraded from 1.10.1 to 1.11.1

Verify this pull request

This pull request is a trivial rework without any test coverage.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

@codecov-io
Copy link

Codecov Report

Merging #2601 (0e6d6b0) into master (77ba561) will decrease coverage by 41.48%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #2601       +/-   ##
============================================
- Coverage     51.17%   9.69%   -41.49%     
+ Complexity     3226      48     -3178     
============================================
  Files           438      53      -385     
  Lines         20089    1929    -18160     
  Branches       2068     230     -1838     
============================================
- Hits          10281     187    -10094     
+ Misses         8961    1729     -7232     
+ Partials        847      13      -834     
Flag Coverage Δ Complexity Δ
hudicli ? ?
hudiclient 100.00% <ø> (ø) 0.00 <ø> (ø)
hudicommon ? ?
hudiflink ? ?
hudihadoopmr ? ?
hudisparkdatasource ? ?
hudisync ? ?
huditimelineservice ? ?
hudiutilities 9.69% <ø> (-59.67%) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...va/org/apache/hudi/utilities/IdentitySplitter.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-2.00%)
...va/org/apache/hudi/utilities/schema/SchemaSet.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-3.00%)
...a/org/apache/hudi/utilities/sources/RowSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
.../org/apache/hudi/utilities/sources/AvroSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-1.00%)
.../org/apache/hudi/utilities/sources/JsonSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-1.00%)
...rg/apache/hudi/utilities/sources/CsvDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-10.00%)
...g/apache/hudi/utilities/sources/JsonDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
...apache/hudi/utilities/sources/JsonKafkaSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-6.00%)
...pache/hudi/utilities/sources/ParquetDFSSource.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-5.00%)
...lities/schema/SchemaProviderWithPostProcessor.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
... and 415 more

@vburenin
Copy link
Contributor Author

@vc I wonder how we may want to proceed here, it looks like 1.10.1 dependency is baked into Spark, so I am struggling to make it to work properly as I am potentially getting class conflicts with 1.11.1 here. Any ideas?

@vinothchandar
Copy link
Member

@vburenin yes. general principle here is to keep the parquet version aligned with the spark version. So this is definitely trickier. if we upgrade parquet-avro alone, it could lead to issues as well.

It almost seems like parquet-avro kind of layer, we need to maintain? (again seems like tall order)

@vburenin
Copy link
Contributor Author

vburenin commented Mar 1, 2021

One thing that we could do is not to include parquet library into the fat jar since spark already comes with it. It would be just a trivial JAR swap.

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Mar 2, 2021
@vinothchandar
Copy link
Member

@vburenin we only bundle parquet-avro, not all of parquet. So the culprit here our dependency on parquet-avro.

Can you try building with a different parquet avro version?

In general, I think we should build differently for spark3 and spark2, since they may have different parquet versions and hence need different parquet-avro? is that what you are running into? what spark version are you at?

@vburenin
Copy link
Contributor Author

vburenin commented Mar 2, 2021

@vinothchandar I am still running on spark2, however I am using a custom docker image where I replaced older parquet libraries with a newer ones. Technically a hack.

@vinothchandar
Copy link
Member

@vburenin got it. We want to keep the parquet version matched with the spark version, w.r.t hudi. Spark3.1 still seems to be on 1.10.1

https://github.com/apache/spark/blob/branch-3.1/pom.xml#L138

Once #2625 is landed, lets see what the spark land is saying about 1.11.1?

@vinothchandar vinothchandar added priority:high Significant impact; potential bugs and removed priority:critical Production degraded; pipelines stalled labels Mar 15, 2021
@nsivabalan
Copy link
Contributor

CC @li36909

@vinothchandar vinothchandar self-assigned this Sep 7, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants