[HUDI-4690][HUDI-4503] Cleaning up Hudi custom Spark `Rule`s #7871

alexeykudinkin · 2023-02-06T20:53:52Z

Change Logs

NOTE: This is a squashed version of the original PR #6361

This PR cleans up considerable amount of Spark's (internal) resolution logic that has been copied over into Hudi components, while in reality there's no actual need for that -- instead we can rely on Spark itself to resolve most of these
constructs and instead plugin our custom implementation (post-hoc) after resolution has been performed.

Issues this will be addressing (among others):

Impact

Resolves many Spark SQL issues stemming from the inconsistent implementation of Hudi's bespoke resolution rules.

Risk level (write none, low medium or high below)

Medium

This is being verified t/h a combination of

CI (GH/Azure)
Long running Jenkins job (verifying Spark DS aspect of it)
Benchmarking

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

alexeykudinkin · 2023-02-06T21:03:53Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

+   * of the [[LogicalRelation]] resolving into Hudi table. Note that, it's a safe operation since we
+   * actually need to ignore these values anyway
+   */
+  case class AdaptIngestionTargetLogicalRelations(spark: SparkSession) extends Rule[LogicalPlan] {


This is the core of the change:

Here we remove implementation of the bespoke resolution rules for Hudi components, instead relying on Spark to resolve these (most of these constructs don't have any custom logic relative to vanilla Spark SQL and therefore would be perfectly fine resolved by standard Spark resolution rules)

Instead we repurpose these rules to serve as conversion point from Spark's standard constructs (like MergeIntoTable) into Hudi's ones that implement Hudi-specific semantic (MergeIntoHoodieTableCommand). Note that, we require these constructs be fully resolved prior to us trying to convert them

alexeykudinkin · 2023-02-06T21:04:41Z

...datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala

- *
- * @param sparkSession
- */
-case class HoodieResolveReferences(sparkSession: SparkSession) extends Rule[LogicalPlan]


This is an example of the custom rule that is removed completely

alexeykudinkin · 2023-02-06T21:05:22Z

...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala

+ *
+ * TODO explain workflow for MOR tables
 */
 case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends HoodieLeafRunnableCommand


Changes to this command were required due to

Switching from bespoke resolution to Spark's standard one: we need to abide by Spark's semantic and had to get rid of some customizations implemented previously

Cleaned up the code considerably simplifying the implementation by getting rid of custom utilities and replacing them w/ Spark's standard ones (for ex, to resolve, bind expressions, etc)

Adding documentation to elaborate on the overall workflow

alexeykudinkin · 2023-02-06T21:08:07Z

...hudi-spark2/src/main/scala/org/apache/spark/sql/catalyst/analysis/HoodieSpark2Analysis.scala

+    private val resolver = spark.sessionState.conf.resolver
+
+    override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
+      case m @ MergeIntoTable(targetTable, sourceTable, _, _, _)


These rules were borrowed from Spark 3.1.3 to bring support for Merge Into statement to Spark 2.x, which doesn't have it out of the box

alexeykudinkin · 2023-02-06T21:14:02Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/action/TableChanges.java


-    public static ColumnUpdateChange get(InternalSchema schema) {
-      return new ColumnUpdateChange(schema);
+    private ColumnUpdateChange(InternalSchema schema) {


These changes are necessary to properly handle various case-sensitivity settings

codope · 2023-02-12T13:29:02Z

...nt/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystExpressionUtils.scala

 }

-object HoodieCatalystExpressionUtils {
+object HoodieCatalystExpressionUtils extends SparkAdapterSupport {


Why does it need to extend SparkAdapterSupport? Is there something that changes across spark versions?

Yes, adapter is needed to match Cast expression (MatchCast object below)

codope · 2023-02-12T13:29:27Z

...-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala

-   */
-  def createInsertInto(table: LogicalPlan, partition: Map[String, Option[String]],
-                       query: LogicalPlan, overwrite: Boolean, ifPartitionNotExists: Boolean): LogicalPlan
+  // TODO scala-docs


add docs or remove the comment?

codope · 2023-02-12T13:37:09Z

hudi-common/src/main/java/org/apache/hudi/common/util/CollectionUtils.java

+   */
+  public static <T, U> U reduce(Collection<T> c, U identity, BiFunction<U, T, U> reducer) {
+    return c.stream()
+        .sequential()


Does it have to be strictly sequential? I mean the elements of collection should be independent of each other. Is there any value add in parameterizing this behavior, say we add a boolean shouldReduceParallelly?

Reducing is inherently sequential operation. In this case i'm just creating a convenience wrapper enforcing this in exchange of simplifying the API provided by the streams (if someone wants to do it in parallel they can use Streams API directly)

codope · 2023-02-12T13:39:06Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/action/TableChange.java

+    protected final boolean caseSensitive;

    BaseColumnChange(InternalSchema schema) {
+      this(schema, false);


why default caseSensitive is false?

To keep things compatible w/ how they are today

codope · 2023-02-12T13:47:21Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

+    val resolver = sparkSession.sessionState.analyzer.resolver
+    val metaFieldMetadata = sparkAdapter.createCatalystMetadataForMetaField
+
+    // TODO elaborate


i think the code below speaks for itself, but please remove TODO and add a comment if you think it needs description.

codope · 2023-02-12T14:05:18Z

...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala

+    //updateActions.foreach(update =>
+    //  assert(update.assignments.length == targetTableSchema.length,
+    //    s"The number of update assignments[${update.assignments.length}] must equal to the " +
+    //      s"targetTable field size[${targetTableSchema.length}]"))


remove the commented part?

codope · 2023-02-12T14:06:16Z

...udi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala

+  }
+
+  def toStructType(attrs: Seq[Attribute]): StructType =
+    StructType(attrs.map(a => StructField(a.qualifiedName.replace('.', '_'), a.dataType, a.nullable, a.metadata)))


why do we need to replace perios by underscores?

Here we combine the output of joining operation back into schema so that we can assert that the records we receive in ExpressionPayload still adhere to the expected schema. Here we need to replace dots, b/c

Names could be qualified (in case of nested structs; note that this schema is essentially a flattened one)

Field names can't have dots

codope · 2023-02-12T14:22:18Z

...-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/HoodieSparkSqlTestBase.scala


+  // NOTE: We need to set "spark.testing" property to make sure Spark can appropriately
+  //       recognize environment as testing
+  System.setProperty("spark.testing", "true")


Interesting! So is this property there for all spark versions and what are the benefits of setting this?

We can configure some of the configs that aren't configurable in prod env

codope · 2023-02-12T14:29:03Z

...hudi-spark2/src/main/scala/org/apache/spark/sql/catalyst/analysis/HoodieSpark2Analysis.scala

+ *
+ *       PLEASE REFRAIN MAKING ANY CHANGES TO THIS CODE UNLESS ABSOLUTELY NECESSARY
+ */
+object HoodieSpark2Analysis {


I thought MIT is supported in Spark 2.x. We have unit tests for MIT that run for Spark 2.x as well, isn't it? If it's not supported, should we make it clear in the quickstart guide as well? And should we add this support in a separate PR?

It's supported.

The difference is that previously Hudi's bespoke MIT resolution logic was applied to all Spark versions not only Spark 2.x. Now,

For Spark 3.x we rely on Spark's own logic to resolve

For Spark 2.x we back-ported code from Spark 3.1.x

codope · 2023-02-12T14:37:38Z

...lus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala

+ *
+ * Check out HUDI-4178 for more details
+ */
+case class HoodieDataSourceV2ToV1Fallback(sparkSession: SparkSession) extends Rule[LogicalPlan]


Can you help me understand why this is only needed for Spark 3.2.x?

B/c we're using DSVv2 only in HoodieCatalog that's Spark >= 3.2

codope

@alexeykudinkin Overall, looks good to me. Thanks for responding to. my queries. Please address some of the minor comments.
@xiarixiaoyao Could you also review this PR?

…ry duplication of the code from Spark

hudi-bot · 2023-02-23T10:04:25Z

CI report:

9763c0f Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…7871) Cleans up considerable amount of Spark's (internal) resolution logic that has been copied over into Hudi components, while in reality there's no actual need for that -- instead we can rely on Spark itself to resolve most of these constructs and instead plugin our custom implementation (post-hoc) after resolution has been performed.

alexeykudinkin mentioned this pull request Feb 6, 2023

[HUDI-4690][HUDI-4503] Cleaning up Hudi custom Spark Rules #6361

Closed

4 tasks

alexeykudinkin requested review from YannByron, codope and xiarixiaoyao February 6, 2023 20:54

alexeykudinkin assigned codope Feb 6, 2023

alexeykudinkin added priority:blocker Production down; release blocker area:sql SQL interfaces engine:spark Spark integration labels Feb 6, 2023

alexeykudinkin force-pushed the ak/spk-sql-clup-sq branch from 9355c62 to be8c865 Compare February 6, 2023 20:57

alexeykudinkin commented Feb 6, 2023

View reviewed changes

alexeykudinkin commented Feb 7, 2023

View reviewed changes

alexeykudinkin removed the priority:blocker Production down; release blocker label Feb 7, 2023

codope added the priority:high Significant impact; potential bugs label Feb 7, 2023

alexeykudinkin added priority:critical Production degraded; pipelines stalled and removed priority:high Significant impact; potential bugs labels Feb 7, 2023

xiarixiaoyao self-assigned this Feb 8, 2023

codope reviewed Feb 12, 2023

View reviewed changes

codope reviewed Feb 20, 2023

View reviewed changes

Alexey Kudinkin added 4 commits February 22, 2023 22:37

Streamlining and simplifying Spark SQL integration to avoid unnecessa…

d5a2ed6

…ry duplication of the code from Spark

Fixing compilation

83e1e18

Fixing after-rebase artifacts

a9a601c

Tidying up

9763c0f

alexeykudinkin force-pushed the ak/spk-sql-clup-sq branch from 286533b to 9763c0f Compare February 23, 2023 06:37

codope approved these changes Feb 23, 2023

View reviewed changes

codope merged commit 812950b into apache:master Feb 23, 2023

jonvex mentioned this pull request Mar 8, 2023

[SUPPORT] Merge Into can not be used #8123

Closed

yihua mentioned this pull request Jun 8, 2023

[HUDI-6198] Support Hudi on Spark 3.4.0 #8885

Merged

4 tasks

majian1998 mentioned this pull request Dec 27, 2023

[HUDI-7264] In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #10419

Closed

4 tasks

hudi-bot mentioned this pull request Nov 30, 2025

In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #16359

Open

[HUDI-4690][HUDI-4503] Cleaning up Hudi custom Spark Rules #7871

[HUDI-4690][HUDI-4503] Cleaning up Hudi custom Spark Rules #7871

Uh oh!

Conversation

alexeykudinkin commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codope left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 23, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[HUDI-4690][HUDI-4503] Cleaning up Hudi custom Spark `Rule`s #7871

[HUDI-4690][HUDI-4503] Cleaning up Hudi custom Spark `Rule`s #7871

alexeykudinkin commented Feb 6, 2023 •

edited

Loading

codope left a comment •

edited

Loading