[WIP][HUDI-4449] Support DataSourceV2 Read for Spark3.2 #6442

southernriver · 2022-08-19T01:38:11Z

Change Logs

NA

Impact

NA

Risk level (write none, low medium or high below)

medium

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

leesf · 2022-08-20T10:15:28Z

...-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/HoodieSparkSqlTestBase.scala

  }

-  override protected def test(testName: String, testTags: Tag*)(testFun: => Any /* Assertion */)(implicit pos: source.Position): Unit = {
+  override protected def test(testName: String, testTags: Tag*)(testFun: => Any /* Assertion */)(implicit pos: org.scalactic.source.Position): Unit = {


unnecessary change?

Since I have introduced org.apache.spark.sql.hudi.source package dircetory for HoodieBatchScan, just avoid conflict.

leesf · 2022-08-20T10:16:23Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestQueryTable.scala

+  test("Test Query None Partitioned Table") {
+    withTempDir { tmp =>
+      val tableName = generateTableName
+      spark.conf.set("hoodie.datasource.v2.read.enable", "true")


do we support for spark.sql("set hoodie.datasource.v2.read.enable = true")?

going to support this.

leesf · 2022-08-20T10:17:26Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestQueryTable.scala

+        Seq(2, "a2", 12.0, 1000)
+      )
+
+      assertThrows[HoodieDuplicateKeyException] {


this is unnecessary?

leesf · 2022-08-20T10:22:42Z

...hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/hudi/source/HoodieBatchScanBuilder.scala

+  }
+
+  override def pruneColumns(structType: StructType): Unit = {
+    // TODO support prune columns.


able to support columns pruning in this PR?

I will open another pr to support this.

leesf · 2022-08-20T10:23:35Z

...asource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/hudi/source/SparkBatchScan.scala

+  }
+
+  override def createReaderFactory(): PartitionReaderFactory = {
+    hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS, classOf[ParquetReadSupport].getName)


not suport ORC?

In my general test, I found upstream SparkSQL does not currently support orc format, may need another pr to support this.

leesf · 2022-08-20T10:24:03Z

...asource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/hudi/source/SparkBatchScan.scala

+                          options: Map[String, String],
+                          @transient hadoopConf: Configuration) extends Batch with Scan {
+
+  override def planInputPartitions(): Array[InputPartition] = {


the code below is copied from spark codebase?

For cow parquet datasource, we can reuse part of this code.

leesf · 2022-08-20T10:27:20Z

does this PR only aimed to support spark 3.2.0 not spark 3.2.1/3.2.2 or 3.3.0? or with following-up tasks to support other versions?

southernriver · 2022-08-22T14:50:46Z

@leesf thanks for your time to review this， I found that there are serveral problems with this pr， since upstream has revert v2 datasource,

hudi/hudi-spark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/hudi/Spark3DefaultSource.scala

Line 28 in 3adb571

    
           class Spark3DefaultSource extends DefaultSource with DataSourceRegister /* with TableProvider */ {

, I'll update soon

yihua · 2022-08-23T18:03:40Z

@alexeykudinkin FYI

alexeykudinkin · 2022-09-30T18:16:18Z

hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala

+   * Convert Filter to Catalyst Expression. If convert success return an Non-Empty
+   * Option[Expression],or else return None.
+   */
+  def convertToCatalystExpression(filter: Filter, tableSchema: StructType): Option[Expression] = {


Let's make sure we're not duplicating utilities we already have (there's HoodieCatalystExpressionUtils doing exactly the same thing)

alexeykudinkin · 2022-09-30T18:16:53Z

...-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/HoodieCatalystPlansUtils.scala

+  /**
+   * Create Like expression.
+   */
+  def createLike(left: Expression, right: Expression): Expression


What do we need this one for?

alexeykudinkin · 2022-09-30T18:17:28Z

.../org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala

   * NOTE: This method is specific to Spark 3.2.0
   */
-  private def createParquetFilters(args: Any*): ParquetFilters = {
+  def createParquetFilters(args: Any*): ParquetFilters = {


Why are we down-grading these?

For SparkBatch，I have reused some functions from outside object Spark32PlusHoodieParquetFileFormat.

Sorry, i don't think i understand. Can you please elaborate? Where these are used?

alexeykudinkin · 2022-09-30T18:20:59Z

...k3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/source/HoodieBatchScanBuilder.scala

+  override def build(): Scan = {
+    val relation = new DefaultSource().createRelation(new SQLContext(spark), options)
+    relation match {
+      case HadoopFsRelation(location, partitionSchema, dataSchema, _, _, options) =>


Whole idea of V2 integration is bypassing the V1 concepts that such as HadoopFsRelation which unfortunately are very limiting for Hudi features.

What's our plan here?

agree, it is still v1 scan here.

It‘s really a huge work to reimplement DataSourceV2 APIs for all hudi tables，My initial idea is to use the lower level V2 read interface to support features like SupportsPushDownFilters and SupportsRuntimeFiltering etc first. And I will continue to follow up on the implementation of new HudiTableScan.

Agreed, it's no small effort, and we should definitely think about how we can approach it incrementally.
But we also need to be mindful that we can't keep things in transitory state for long -- we can't have the integration spread b/w V1 and V2 and we have to commit and migrate fully either way, w/ a clear plan for the deprecation of the one.

I'd suggest for feature of this scale we should actually write a proper RFC to cover all of these concerns. What do you think?

BTW this will be greatly helped by RFC-64 we're currently working on, so we'd sync and sequence our efforts to make sure there's minimal to no duplication in what we do

alexeykudinkin · 2022-09-30T18:24:13Z

...ce/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/source/SparkBatch.scala

+    // TODO: if you move this into the closure it reverts to the default values.
+    // If true, enable using the custom RecordReader for parquet. This only works for
+    // a subset of the types (no complex types).
+    val resultSchema: StructType = StructType(partitionSchema.fields ++ requiredSchema.fields)


Do we really need to clone whole Hudi's custom ParquetFileFormat impl here?

There is some minor change to construct createReaderFactory, we may extend suitable action to support more accurate task planning

Sorry, i still i'm not sure i understand what we're copying this for. Can you please elaborate what exactly you're trying to modify here?

We should copy code from Spark only in exceptional circumstances when there's just no other way around it. Otherwise we should avoid doing that at all costs.

leesf · 2022-10-31T12:29:26Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestQueryTable.scala

+
+      spark.sql(s"set hoodie.datasource.v2.read.enable=true")
+
+      val query =String.format("SELECT f.id, f.price, f.ts, f.dt, f.name FROM %s f JOIN dim d ON f.name = d.name AND d.id = 1 ORDER BY id", tableName)


southernriver · 2022-10-31T13:09:35Z

@leesf @alexeykudinkin @XuQianJin-Stars Thanks for your time. I've been busy with other things these days， sorry for the late reply.

hudi-bot · 2022-11-05T09:36:09Z

CI report:

8040b17 Azure: FAILURE Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin · 2022-11-10T21:57:36Z

.../org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala

   * NOTE: This method is specific to Spark 3.2.0
   */
-  private def createParquetFilters(args: Any*): ParquetFilters = {
+  def createParquetFilters(args: Any*): ParquetFilters = {


Sorry, i don't think i understand. Can you please elaborate? Where these are used?

alexeykudinkin · 2022-11-10T21:58:58Z

...k3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieInternalV2Table.scala

+    new HoodieBatchScanBuilder(spark, hoodieCatalogTable, scanOptions)
+  }
+
+  private def buildHoodieScanConfig(caseInsensitiveStringMap: CaseInsensitiveStringMap,


We can inline this method (it's trivial)

alexeykudinkin · 2022-11-10T22:06:26Z

...ce/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/source/SparkBatch.scala

+    // TODO: if you move this into the closure it reverts to the default values.
+    // If true, enable using the custom RecordReader for parquet. This only works for
+    // a subset of the types (no complex types).
+    val resultSchema: StructType = StructType(partitionSchema.fields ++ requiredSchema.fields)


Sorry, i still i'm not sure i understand what we're copying this for. Can you please elaborate what exactly you're trying to modify here?

We should copy code from Spark only in exceptional circumstances when there's just no other way around it. Otherwise we should avoid doing that at all costs.

alexeykudinkin · 2022-11-10T22:14:06Z

...k3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/source/HoodieBatchScanBuilder.scala

+  override def build(): Scan = {
+    val relation = new DefaultSource().createRelation(new SQLContext(spark), options)
+    relation match {
+      case HadoopFsRelation(location, partitionSchema, dataSchema, _, _, options) =>


Agreed, it's no small effort, and we should definitely think about how we can approach it incrementally.
But we also need to be mindful that we can't keep things in transitory state for long -- we can't have the integration spread b/w V1 and V2 and we have to commit and migrate fully either way, w/ a clear plan for the deprecation of the one.

I'd suggest for feature of this scale we should actually write a proper RFC to cover all of these concerns. What do you think?

alexeykudinkin · 2022-11-10T22:15:28Z

...k3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/source/HoodieBatchScanBuilder.scala

+  override def build(): Scan = {
+    val relation = new DefaultSource().createRelation(new SQLContext(spark), options)
+    relation match {
+      case HadoopFsRelation(location, partitionSchema, dataSchema, _, _, options) =>


BTW this will be greatly helped by RFC-64 we're currently working on, so we'd sync and sequence our efforts to make sure there's minimal to no duplication in what we do

yihua

Closing this stale PR. We're revisiting Spark Datasource V2 support in a new effort. cc @geserdugarov

southernriver changed the title ~~Spark3.2: Support DataSourceV2 Read~~ [HUDI-4449]: Support DataSourceV2 Read for Spark3.2 Aug 19, 2022

southernriver changed the title ~~[HUDI-4449]: Support DataSourceV2 Read for Spark3.2~~ [HUDI-4449] Support DataSourceV2 Read for Spark3.2 Aug 19, 2022

XuQianJin-Stars self-assigned this Aug 19, 2022

leesf reviewed Aug 20, 2022

View reviewed changes

leesf self-assigned this Aug 20, 2022

yihua added the engine:spark Spark integration label Aug 23, 2022

southernriver force-pushed the HUDI-4449 branch from 30a035f to 208824b Compare September 2, 2022 14:44

yihua added the priority:critical Production degraded; pipelines stalled label Sep 6, 2022

southernriver force-pushed the HUDI-4449 branch from 208824b to 8156bda Compare September 6, 2022 12:00

yihua added the big-needle-movers label Sep 14, 2022

southernriver force-pushed the HUDI-4449 branch 2 times, most recently from a585bf1 to a42568c Compare September 21, 2022 06:14

southernriver force-pushed the HUDI-4449 branch from a42568c to 087d98c Compare September 28, 2022 07:27

alexeykudinkin reviewed Sep 30, 2022

View reviewed changes

southernriver added 2 commits October 31, 2022 17:33

Support DataSourceV2 Read for Spark3.2

a7ef0e8

fix code style.

19b2ab9

southernriver force-pushed the HUDI-4449 branch from 087d98c to 19b2ab9 Compare October 31, 2022 10:53

leesf reviewed Oct 31, 2022

View reviewed changes

fix code style.

8040b17

alexeykudinkin reviewed Nov 10, 2022

View reviewed changes

alexeykudinkin changed the title ~~[HUDI-4449] Support DataSourceV2 Read for Spark3.2~~ [WIP][HUDI-4449] Support DataSourceV2 Read for Spark3.2 Nov 10, 2022

vinothchandar self-assigned this Feb 16, 2023

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024

vinothchandar unassigned vinothchandar, leesf and XuQianJin-Stars May 23, 2025

geserdugarov mentioned this pull request Jul 24, 2025

[HUDI-4449] Claim of RFC-98, Spark Datasource V2 Read #13609

Merged

4 tasks

hudi-bot mentioned this pull request Dec 9, 2025

Spark: Support DataSourceV2 Read #15292

Open

yihua reviewed Dec 13, 2025

View reviewed changes

yihua closed this Dec 13, 2025

github-project-automation bot moved this from 🆕 New to ✅ Done in Hudi PR Support Dec 13, 2025


		spark.sql(s"set hoodie.datasource.v2.read.enable=true")

		val query =String.format("SELECT f.id, f.price, f.ts, f.dt, f.name FROM %s f JOIN dim d ON f.name = d.name AND d.id = 1 ORDER BY id", tableName)

[WIP][HUDI-4449] Support DataSourceV2 Read for Spark3.2 #6442

[WIP][HUDI-4449] Support DataSourceV2 Read for Spark3.2 #6442

Uh oh!

Conversation

southernriver commented Aug 19, 2022 • edited by XuQianJin-Stars Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leesf Aug 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leesf commented Aug 20, 2022

Uh oh!

southernriver commented Aug 22, 2022

Uh oh!

yihua commented Aug 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

southernriver commented Oct 31, 2022

Uh oh!

hudi-bot commented Nov 5, 2022

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

southernriver commented Aug 19, 2022 •

edited by XuQianJin-Stars

Loading

leesf Aug 20, 2022 •

edited

Loading