diff --git a/rfc/README.md b/rfc/README.md index 4c9396a8cce16..647c4fd0748a6 100644 --- a/rfc/README.md +++ b/rfc/README.md @@ -133,3 +133,4 @@ The list of all RFCs can be found here. | 95 | Hudi Flink Source | `UNDER REVIEW` | | 96 | Introduce Unified Bucket Index | `UNDER REVIEW` | | 97 | Deprecate Hudi Payload Class Usage | `UNDER REVIEW` | +| 98 | [Spark Datasource V2 Read](./rfc-98/rfc-98.md) | `UNDER REVIEW` | diff --git a/rfc/rfc-98/initial_integration_with_Spark.jpg b/rfc/rfc-98/initial_integration_with_Spark.jpg new file mode 100755 index 0000000000000..cc21c283c3ae6 Binary files /dev/null and b/rfc/rfc-98/initial_integration_with_Spark.jpg differ diff --git a/rfc/rfc-98/rfc-98.md b/rfc/rfc-98/rfc-98.md new file mode 100644 index 0000000000000..aea34cb5887f9 --- /dev/null +++ b/rfc/rfc-98/rfc-98.md @@ -0,0 +1,76 @@ + + +# RFC-98: Spark Datasource V2 Read + +## Proposers + +- @geserdugarov + +## Approvers + +- @ + +## Status + +Umbrella ticket: [HUDI-4449](https://issues.apache.org/jira/browse/HUDI-4449) + +## Abstract + +Data source is one of the foundational APIs in Spark, with two major versions known as "V1" and "V2". +The representation of a read in the physical plan differs depending on the API version used. +Adopting the V2 API is essential for enhanced control over the data source, deeper integration with the Spark optimizer, and improved overall performance. + +First steps towards integrating of Spark Datasource V2 were taken in [RFC-38](../rfc-38/rfc-38.md). +However, there are multiple issues with advertising Hudi table as V2 without actual implementing certain API, and with using custom relation rule to fall back to V1 API. +As a result, the current implementation of `HoodieCatalog` and `Spark3DefaultSource` returns a `V1Table` instead of `HoodieInternalV2Table`, +in order to [address performance regression](https://github.com/apache/hudi/pull/5737). + +There was [an attempt](https://github.com/apache/hudi/pull/6442) to implement Spark Datasource V2 read functionality as a regular task, +but it failed due to the scope of work required. +Therefore, this RFC proposes to discuss design of Spark Datasource V2 integration in advance and to continue working on it accordingly. + +## Background + +The current implementation of Spark Datasource V2 integration is presented in the schema below: + +![Current integration with Spark](initial_integration_with_Spark.jpg) + +## Implementation + + + +### Read + + + +### Table services + + + +## Rollout/Adoption Plan + + + +## Test Plan + +