[HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module #6550

alexeykudinkin · 2022-08-31T02:34:45Z

Change Logs

As part of adding support for Spark 3.3 in Hudi 0.12, a lot of the logic from Spark 3.2 module has been simply copied over.

This PR is rectifying that by

Creating new module "hudi-spark3.2plus-common" (that is shared across Spark 3.2 and Spark 3.3)
Moving shared components under "hudi-spark3.2plus-common"

Impact

Low

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

codope · 2022-09-01T16:26:19Z

@alexeykudinkin The build for spark3 profile failed for me locally. Did it pass for you?
Fails for all Spark3 profiles.

[ERROR] Failed to execute goal on project hudi-utilities_2.12: Could not resolve dependencies for project org.apache.hudi:hudi-utilities_2.12:jar:0.13.0-SNAPSHOT: Failure to find io.streamnative.connectors:pulsar-spark-connector_2.12:jar:2.4.5 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of Maven Central has elapsed or updates are forced

alexeykudinkin · 2022-09-01T21:44:18Z

@codope I've realized what's the issue you've stumbled upon: you need to specify -Dscala-2.12 profile explicitly to make sure correct version of Pulsar connector is picked up.

I'm gonna address this issue in this PR as well.

alexeykudinkin · 2022-09-01T22:18:41Z

hudi-utilities/pom.xml

    </dependency>

-    <!-- Hoodie -->
-    <dependency>


These are not necessary -- we can just depend on the terminal "hudi_spark" module

alexeykudinkin · 2022-09-01T22:20:20Z

packaging/hudi-integ-test-bundle/pom.xml

    <dependency>
      <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-common</artifactId>
+      <artifactId>hudi-spark_${scala.binary.version}</artifactId>


GH really weirdly shows re-ordering of some of these.

Changes:

Removed unnecessary deps (see comment above)

Re-ordered Spark deps to be grouped together (for easier discovery)

...k3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala

pom.xml

hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml

codope · 2022-09-02T10:16:22Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala


    } else if (HoodieSparkUtils.gteqSpark3_1) {
-      val spark31ResolveAlterTableCommandsClass = "org.apache.spark.sql.hudi.Spark312ResolveHudiAlterTableCommand"
+      val spark31ResolveAlterTableCommandsClass = "org.apache.spark.sql.hudi.Spark31ResolveHudiAlterTableCommand"


Is this a problem if not all writers and readers use the same Hudi version? Maybe not in this case, but just calling it out to think through the change of class names. I had encountered one issue while doing HBase upgrade (however that was because we actually wrote the kv comparator class name in data files).

Yeah, in this case class renames don't have any impact

hudi-spark-datasource/hudi-spark/pom.xml

codope · 2022-09-02T10:19:34Z

...source/hudi-spark3-common/src/main/scala/org/apache/spark/sql/HoodieSpark3CatalogUtils.scala

+      sparkAdapter.getCatalogUtils.asInstanceOf[HoodieSpark3CatalogUtils]
+        .unapplyBucketTransform(t)
+  }
+


nit: extra newline

codope · 2022-09-02T10:20:19Z

...source/hudi-spark3-common/src/main/scala/org/apache/spark/sql/HoodieSpark3CatalogUtils.scala

+   * </ol>
+   */
+  def unapplyBucketTransform(t: Transform): Option[(Int, Seq[NamedReference], Seq[NamedReference])]
+


nit: extra newline

codope · 2022-09-02T10:34:28Z

...tasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/HoodieSpark31CatalogUtils.scala

-  override def output: Seq[Attribute] = Nil
+  override def unapplyBucketTransform(t: Transform): Option[(Int, Seq[NamedReference], Seq[NamedReference])] =
+    t match {
+      case BucketTransform(numBuckets, ref) => Some(numBuckets, Seq(ref), Seq.empty)


Why sorted refs is an empty sequence? Did BucketTransform not support sorted refs argument in Spark 3.1 or 3.2?

Correct, they for whatever reason just submit a single ref

codope · 2022-09-02T10:41:08Z

...i-spark3.1.x/src/main/scala/org/apache/spark/sql/hudi/command/Spark31AlterTableCommand.scala


 // TODO: we should remove this file when we support datasourceV2 for hoodie on spark3.1x
-case class AlterTableCommand312(table: CatalogTable, changes: Seq[TableChange], changeType: ColumnChangeID) extends RunnableCommand with Logging {
+case class Spark31AlterTableCommand(table: CatalogTable, changes: Seq[TableChange], changeType: ColumnChangeID) extends RunnableCommand with Logging {


I see that the refactoring has been done with the assumption that Spark won't break things between patch versions. I think it's a fair assumption. However, it cannot be guaranteed. Just something to be cautious about in future.

We do refined our Spark compatibility mode in 0.11: we now promise we'd stay compatible w/ ALL versions w/in a minor branch.

codope · 2022-09-02T10:49:16Z

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32DataSourceUtils.scala

-import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy
-import org.apache.spark.util.Utils
-
-object Spark32DataSourceUtils {


Perhaps due to removal of this, we need that Legacy behavior policy in avro deserializer. Would be good to move some of the comments there.

Nope, not propagating a config was a miss before (b/c of duplication of the classes, it was handled in 3.2, but not in 3.1)

hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml

codope · 2022-09-02T11:02:54Z

...ource/hudi-spark3.2plus-common/src/main/scala/org/apache/hudi/Spark32PlusDefaultSource.scala

 *       Please check out HUDI-4178 for more details
 */
-class Spark3DefaultSource extends DefaultSource with DataSourceRegister /* with TableProvider */ {
+class Spark32PlusDefaultSource extends DefaultSource with DataSourceRegister /* with TableProvider */ {


can remove all the commented part in this class

The plan is to restore it once we migrate to DSv2

codope · 2022-09-02T11:09:05Z

pom.xml

        <hudi.spark.module>hudi-spark3.2.x</hudi.spark.module>
-        <hudi.spark.common.module>hudi-spark3-common</hudi.spark.common.module>
+        <!-- This glob has to include hudi-spark3-common, hudi-spark3.2plus-common -->
+        <hudi.spark.common.modules.glob>hudi-spark3*-common</hudi.spark.common.modules.glob>


Does it make sense to avoid glob pattern and define separate parameter like hudi.spark2.common.module, hudi.spark3.common.module, hudi.spark32plus.common.module, hudi.spark33plus.common.module (in future)? It makes it easier to cherry-pick, otherwise we need to maintain glob pattern every so often.

Yeah, not a big fan of globbing (it's quite brittle). Let me try to have separate property (the only reason i opted for globbing initially was b/c wasn't sure if Maven will be able to handle an empty clause, since for ex, this parameter for Spark 2 and 3.1 would be empty)

+1 prefer to explicitly add

nsivabalan

Few minor comments. mostly looks ok to me.
I have 2 ask though:

Lets run integration tests for all diff spark versions and ensure a green report.
Lets test out slim bundles as well ( as I see some changes in slim bundles pom as well)

nsivabalan · 2022-09-06T01:45:47Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

      val spark3Analysis: RuleBuilder =
        session => ReflectionUtils.loadClass(spark3AnalysisClass, session).asInstanceOf[Rule[LogicalPlan]]

-      val spark3ResolveReferencesClass = "org.apache.spark.sql.hudi.analysis.HoodieSpark3ResolveReferences"


may be not related to changes in this patch. but "HoodieSpark3Analysis" actually refers to 3.2 or greater right? but the naming is not right. I don't see it being used for 3.1 below.

may be you can name the variable in L78, 79 accordingly.

This class is actually deleted

nsivabalan · 2022-09-06T01:56:17Z

...rk3.1.x/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister



-org.apache.hudi.Spark3xDefaultSource
+org.apache.hudi.Spark31DefaultSource


should we keep it as Spark31xDefaultSource

We're using 2 digits nomenclature to designate Spark version in class names (Spark31, Spark32, etc)

nsivabalan · 2022-09-06T01:58:38Z

...udi-spark3.1.x/src/main/scala/org/apache/spark/sql/avro/HoodieSpark3_1AvroDeserializer.scala


-  private val avroDeserializer = new AvroDeserializer(rootAvroType, rootCatalystType)
+  private val avroDeserializer = {
+    val avroRebaseModeInRead = LegacyBehaviorPolicy.withName(SQLConf.get.getConf(SQLConf.LEGACY_AVRO_REBASE_MODE_IN_READ))


as per master, I see this conf is of interest only for spark3.2 and above. But now, we are also adding it for spark3.1. is that intentional ?

hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml

nsivabalan · 2022-09-06T02:13:20Z

...udi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala

      case IdentityTransform(FieldReference(Seq(col))) =>
        identityCols += col

+      case MatchBucketTransform(numBuckets, col, sortCol) =>


is this some bug fix. doesn't look like pure refactoring.

This is fix for BucketTransform changed b/w Spark 3.2 and 3.3

nsivabalan · 2022-09-06T02:15:29Z

...spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieStagedTable.scala

 import java.net.URI
 import java.util
-import scala.collection.JavaConverters.{mapAsScalaMapConverter, setAsJavaSetConverter}
+import scala.jdk.CollectionConverters.{mapAsScalaMapConverter, setAsJavaSetConverter}


was this intentional change? I see we widely use scala.collection.JavaConverters. ? If you prefer to replace all usages,can we file a tracking jira and do it separately.

JavaConverters are not available in Scala 2.11

nsivabalan · 2022-09-06T02:20:36Z

pom.xml

                    <exclude>ch.qos.logback:logback-classic</exclude>
+                    <!-- NOTE: We're banning any HBase deps versions other than the approved ${hbase.version},
+                               which is aimed at preventing the classpath collisions w/ transitive deps usually) -->
+                    <exclude>org.apache.hbase:hbase-common:*</exclude>


@yihua : can you review hbase related changes. mostly should be good, just wanted to double confirm.

LGTM. In terms of the actual impact, does this only resolve issues for Spark 3 due to transitive dependencies of Hive? I assume without this change, Spark 2 should still work.

These should resolve issues for all Spark versions

nsivabalan · 2022-09-06T02:21:18Z

pom.xml

        <hudi.spark.module>hudi-spark3.2.x</hudi.spark.module>
-        <hudi.spark.common.module>hudi-spark3-common</hudi.spark.common.module>
+        <!-- This glob has to include hudi-spark3-common, hudi-spark3.2plus-common -->
+        <hudi.spark.common.modules.glob>hudi-spark3*-common</hudi.spark.common.modules.glob>


+1 prefer to explicitly add

nsivabalan · 2022-09-06T02:27:03Z

hudi-utilities/pom.xml

    <dependency>
      <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-spark-common_${scala.binary.version}</artifactId>
+      <artifactId>hudi-common</artifactId>


hudi-spark pulls in hudi-common as well right. do we need to explicitly depend here?

@nsivabalan the note above explained it

packaging/hudi-spark-bundle/pom.xml

pom.xml

codope · 2022-09-09T11:50:38Z

azure-pipelines.yml


 variables:
-  BUILD_PROFILES: '-Dscala-2.11 -Dspark2 -Dflink1.14'
+  BUILD_PROFILES: '-Dscala-2.11 -Dspark2.4 -Dflink1.14'


Shall we update the docs/readme as well?

codope · 2022-09-09T11:56:14Z

packaging/hudi-utilities-bundle/pom.xml

      <artifactId>hive-service</artifactId>
      <version>${hive.version}</version>
      <scope>${utilities.bundle.hive.scope}</scope>
+      <exclusions>


hive-* and hbase-* dependencies are already shaded. wouldn't that be sufficient? why do we need to exclude here?

Hive brings its own HBase deps, that collied w/ the ones we depend on (we need 2.4.9, Hive brings 1.1.1)

…k3-common"

Removed redundant deps declarations; Added "scala-collection-compat" dep

Removed duplicated class in Spark 3.3

…r classes shared by Spark >= 3.2 modules)

…ve modules

… in (clashing w/ the HBase version we're dependent on)

Rebased Azure pipeline on "spark2.4"

…t "hudi-spark2.4-bundle_2.11-..."

alexeykudinkin · 2022-09-12T20:48:04Z

CI is green:

https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11284&view=results

xushiyan

LGTM, except the modules.glob part. Also hard to spot any further issue from the pom diff. this requires more bundle testing with different combinations

hudi-bot · 2022-09-14T08:57:22Z

CI report:

37309a8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

LGTM. Thanks for addressing the comments.

danny0405 · 2022-09-16T07:59:10Z

pom.xml

+
+      <dependency>
+        <groupId>org.scala-lang.modules</groupId>
+        <artifactId>scala-collection-compat_${scala.binary.version}</artifactId>


Hey, these 2 dependencies affect the hudi-flink module, and the local tests can not run now because flink 1.15.x requires scala 2.12.x, can you fix that ?

alexeykudinkin force-pushed the ak/spk3-dedup branch from 68d21db to 4218ff9 Compare August 31, 2022 06:10

alexeykudinkin changed the title ~~[WIP][HUDI-4690] Cleaning up duplicated classes in Spark 3.3 module~~ [HUDI-4690] Cleaning up duplicated classes in Spark 3.3 module Aug 31, 2022

alexeykudinkin changed the title ~~[HUDI-4690] Cleaning up duplicated classes in Spark 3.3 module~~ [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module Sep 1, 2022

codope self-assigned this Sep 1, 2022

alexeykudinkin commented Sep 1, 2022

View reviewed changes

xushiyan reviewed Sep 2, 2022

View reviewed changes

...k3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark3Analysis.scala Show resolved Hide resolved

xushiyan reviewed Sep 2, 2022

View reviewed changes

pom.xml Show resolved Hide resolved

xushiyan reviewed Sep 2, 2022

View reviewed changes

hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml Show resolved Hide resolved

codope reviewed Sep 2, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/spk3-dedup branch from 8a43a07 to 2e05253 Compare September 2, 2022 18:51

nsivabalan self-assigned this Sep 3, 2022

yihua added engine:spark Spark integration priority:blocker Production down; release blocker labels Sep 5, 2022

nsivabalan reviewed Sep 6, 2022

View reviewed changes

xushiyan reviewed Sep 8, 2022

View reviewed changes

packaging/hudi-spark-bundle/pom.xml Show resolved Hide resolved

pom.xml Show resolved Hide resolved

alexeykudinkin force-pushed the ak/spk3-dedup branch from 684ca9b to bd41f0d Compare September 9, 2022 03:56

codope reviewed Sep 9, 2022

View reviewed changes

alexeykudinkin force-pushed the ak/spk3-dedup branch from bd41f0d to 83e1902 Compare September 9, 2022 20:56

yihua mentioned this pull request Sep 10, 2022

[HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath #6631

Merged

4 tasks

Alexey Kudinkin added 9 commits September 9, 2022 18:26

Moved HoodieSpark3Analysis and required classes along to "hudi-spar…

d27b712

…k3-common"

Removed duplication from Spark 3.3 module

09d154e

Unified Scala library deps to not over-specify version in every module;

bc75ebf

Removed redundant deps declarations; Added "scala-collection-compat" dep

Moved HoodieIdentifier to "hudi-spark3-common";

f3c9c6c

Removed duplicated class in Spark 3.3

Missing licenses

5d66c01

Provisioned new "hudi-spark3.2plus-common" project (to accommodate fo…

9259aa5

…r classes shared by Spark >= 3.2 modules)

Moved clases pertaining to Spark >= 3.2 under "hudi-spark3.2plus-common"

e229784

Fixed invalid module refs

5495b03

Fixing module id

693186c

Alexey Kudinkin added 14 commits September 9, 2022 18:26

Fixed invalid ref on "hbase-server"

86d5562

Fixing compilation for Spark 3.x

1c7683a

Cleaning up "spark2" profile

bbba30a

Removed "hudi-cli" dep from "hudi-spark-examples"

4f8101f

Exclude "hbase-client", "hbase-common" transitively pulled in from Hi…

bcab7cd

…ve modules

Added enforcer rules to make sure no transitive HBase deps are pulled…

c089a68

… in (clashing w/ the HBase version we're dependent on)

Excluding HBase deps pulled in by Hive in "hudi-spark-bundle"

c2e28ce

Consolidated exclusions in the root POM

5acd5b1

Restored "spark2" and "spark3" profiles (for BWC);

282d49d

Rebased Azure pipeline on "spark2.4"

Fixed all Hive deps to exclude any transitive HBase deps

55d3249

Tidying up

64e9753

Fixing typo

7f9e915

Tidying up

80e9ea2

Fixed Hudi IT scripts to properly alias bundle jars (having the forma…

3858976

…t "hudi-spark2.4-bundle_2.11-..."

alexeykudinkin force-pushed the ak/spk3-dedup branch from c651891 to 3858976 Compare September 10, 2022 01:42

yihua mentioned this pull request Sep 12, 2022

[HUDI-4478] Rename existing Spark/Flink modules concisely #6230

Closed

5 tasks

xushiyan reviewed Sep 13, 2022

View reviewed changes

Unravel globbing adn replaced w/ module-ref placeholders

37309a8

codope approved these changes Sep 14, 2022

View reviewed changes

codope merged commit 8c296e0 into apache:master Sep 14, 2022

danny0405 reviewed Sep 16, 2022

View reviewed changes



		org.apache.hudi.Spark3xDefaultSource
		org.apache.hudi.Spark31DefaultSource

[HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module #6550

[HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module #6550

Uh oh!

Conversation

alexeykudinkin commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Contributor's checklist

Uh oh!

codope commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexeykudinkin commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

alexeykudinkin commented Aug 31, 2022 •

edited

Loading

codope commented Sep 1, 2022 •

edited

Loading

alexeykudinkin commented Sep 1, 2022 •

edited

Loading

alexeykudinkin Sep 2, 2022 •

edited

Loading

alexeykudinkin Sep 2, 2022 •

edited

Loading

xushiyan left a comment •

edited

Loading