Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Mar 3, 2021

What is the purpose of the pull request

Looks like we don't generate spark3 bundles at all(all artifacts in maven is for spark2). this patch is looking to fix it.
As of now, we are required to pass command line args while packaging as below.

mvn clean package -DskipTests -Dspark3 -Dscala-2.12

but the issue with this approach is that, spark bundle that gets created for both spark3_2.12 and spark2_2.12 has the same name. so, we can't differentiate in maven.

hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar

In this patch, modifying the bundle names depending on spark version.

hudi-spark2-bundle_2.12-0.8.0-SNAPSHOT.jar
hudi-spark3-bundle_2.12-0.8.0-SNAPSHOT.jar

respectively for spark2 and spark3.

So, during release process, RM might have to deploy artifacts thrice.

./release/deploy_staging_jars.sh --scala_version=2.11 --spark_version=2
./release/deploy_staging_jars.sh --scala_version=2.12 --spark_version=2
./release/deploy_staging_jars.sh --scala_version=2.12 --spark_version=3

Running the deploy thrice should be fine. Bcoz, even today we run it twice once for scala 11 and once for scala 12. so, common artifacts (which does not depend on scala version) will get overridden 2nd time, but its same as previous run as well. Only those artifacts which has _12 suffix will be added as new artifacts when deploy is called 2nd time.

To be discussed:

  1. I have introduced a new variable called sparkbundle.version to inject "2" or "3" with bundle name. If we were to resort to spark.version, artifact generated will be as follows. Not sure if we really need to reflect the actual spark version in artifact name.
hudi-spark2.4.4.-bundle_2.12-0.8.0-SNAPSHOT.jar
hudi-spark3.0.0-bundle_2.12-0.8.0-SNAPSHOT.jar
  1. Do we need to stop uploading hudi-spark_2.12-0.7.0.jar since we also upload
    hudi-spark2_2.12-0.7.0.jar and hudi-spark3_2.12-0.7.0-jar.
    Basically we have all these for 0.7.0 in maven right now.
    hudi-spark_2.11-0.7.0.jar
    hudi-spark_2.12-0.7.0.jar
    hudi-spark2_2.11-0.7.0.jar
    hudi-spark2_2.12-0.7.0.jar
    hudi-spark3_2.12-0.7.0.jar

Proposal is to avoid uploading first 2 in the above list.

Brief change log

  • Fixed bundle names depending on spark versions.
  • Fixed release script to take in spark version.

Verify this pull request

  • Ran quick start utils for MOR table and read succeeded

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan
Copy link
Contributor Author

nsivabalan commented Mar 3, 2021

@codecov-io
Copy link

codecov-io commented Mar 3, 2021

Codecov Report

Merging #2625 (d3823b4) into master (527175a) will increase coverage by 0.81%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2625      +/-   ##
============================================
+ Coverage     50.94%   51.76%   +0.81%     
- Complexity     3169     3596     +427     
============================================
  Files           433      475      +42     
  Lines         19814    22553    +2739     
  Branches       2034     2406     +372     
============================================
+ Hits          10095    11674    +1579     
- Misses         8900     9861     +961     
- Partials        819     1018     +199     
Flag Coverage Δ Complexity Δ
hudicli 37.01% <ø> (+0.11%) 0.00 <ø> (ø)
hudiclient 100.00% <ø> (ø) 0.00 <ø> (ø)
hudicommon 50.92% <ø> (-0.51%) 0.00 <ø> (ø)
hudiflink 54.28% <ø> (+11.06%) 0.00 <ø> (ø)
hudihadoopmr 33.44% <ø> (+0.28%) 0.00 <ø> (ø)
hudisparkdatasource 70.93% <ø> (+1.20%) 0.00 <ø> (ø)
hudisync 45.70% <ø> (-2.92%) 0.00 <ø> (ø)
huditimelineservice 64.36% <ø> (-2.13%) 0.00 <ø> (ø)
hudiutilities 69.94% <ø> (+0.49%) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
.../java/org/apache/hudi/common/util/CommitUtils.java 43.58% <0.00%> (-28.42%) 6.00% <0.00%> (ø%)
...java/org/apache/hudi/common/util/CleanerUtils.java 47.72% <0.00%> (-15.91%) 6.00% <0.00%> (ø%)
...ache/hudi/common/fs/inline/InMemoryFileSystem.java 79.31% <0.00%> (-10.35%) 15.00% <0.00%> (-1.00%)
...g/apache/hudi/timeline/service/RequestHandler.java 73.37% <0.00%> (-4.36%) 30.00% <0.00%> (+1.00%) ⬇️
...i/common/table/timeline/HoodieDefaultTimeline.java 80.51% <0.00%> (-4.21%) 59.00% <0.00%> (ø%)
...src/main/scala/org/apache/hudi/DefaultSource.scala 84.14% <0.00%> (-4.09%) 17.00% <0.00%> (+2.00%) ⬇️
...hudi/utilities/sources/helpers/KafkaOffsetGen.java 85.84% <0.00%> (-2.94%) 20.00% <0.00%> (+4.00%) ⬇️
...che/hudi/common/table/log/HoodieLogFileReader.java 66.09% <0.00%> (-1.77%) 23.00% <0.00%> (+1.00%) ⬇️
...g/apache/hudi/common/model/WriteOperationType.java 51.42% <0.00%> (-1.70%) 2.00% <0.00%> (ø%)
...e/hudi/common/table/log/HoodieLogFormatWriter.java 78.12% <0.00%> (-1.57%) 26.00% <0.00%> (ø%)
... and 118 more

@nsivabalan nsivabalan changed the title [1568] Adding hudi-spark3-bundle [1568] Fixing spark3 bundles Mar 5, 2021
@garyli1019 garyli1019 added the priority:blocker Production down; release blocker label Mar 8, 2021
@vinothchandar vinothchandar self-assigned this Mar 10, 2021
@vinothchandar vinothchandar added the priority:critical Production degraded; pipelines stalled label Mar 10, 2021
@vinothchandar
Copy link
Member

vinothchandar commented Mar 14, 2021

@nsivabalan thanks for this. I think generating thrice is okay. Don't see any other way, to avoid this.

Whenever we can, we should avoid any breaking changes . So can we not change name of the existing bundle name for spark2. i.e hudi-spark2-bundle_2.12-0.8.0-SNAPSHOT.jar ? We can just generate one for spark3 called hudi-spark3-bundle_2.12-0.8.0-SNAPSHOT.jar

Having sparkbundle.version is okay. can we set this through the spark3 maven profile though? That way if you run a run for spark3, the bundle is called hudi-spark3... ? Keeping this in-sync with spark.version inside pom is a better approach. less variables to remember at the command line . wdyt?

Do we need to stop uploading hudi-spark_2.12-0.7.0.jar

So this is what users who write spark jobs depend on today. We should keep this as-is? They will ultimately compile their application against the right spark version anyway.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nsivabalan nsivabalan merged commit 55a489c into apache:master Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants