-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[WIP][HUDI-4449] Support DataSourceV2 Read for Spark3.2 #6442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| override protected def test(testName: String, testTags: Tag*)(testFun: => Any /* Assertion */)(implicit pos: source.Position): Unit = { | ||
| override protected def test(testName: String, testTags: Tag*)(testFun: => Any /* Assertion */)(implicit pos: org.scalactic.source.Position): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I have introduced org.apache.spark.sql.hudi.source package dircetory for HoodieBatchScan, just avoid conflict.
| test("Test Query None Partitioned Table") { | ||
| withTempDir { tmp => | ||
| val tableName = generateTableName | ||
| spark.conf.set("hoodie.datasource.v2.read.enable", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we support for spark.sql("set hoodie.datasource.v2.read.enable = true")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to support this.
| Seq(2, "a2", 12.0, 1000) | ||
| ) | ||
|
|
||
| assertThrows[HoodieDuplicateKeyException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unnecessary?
| } | ||
|
|
||
| override def pruneColumns(structType: StructType): Unit = { | ||
| // TODO support prune columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
able to support columns pruning in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open another pr to support this.
| } | ||
|
|
||
| override def createReaderFactory(): PartitionReaderFactory = { | ||
| hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS, classOf[ParquetReadSupport].getName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not suport ORC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my general test, I found upstream SparkSQL does not currently support orc format, may need another pr to support this.
| options: Map[String, String], | ||
| @transient hadoopConf: Configuration) extends Batch with Scan { | ||
|
|
||
| override def planInputPartitions(): Array[InputPartition] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code below is copied from spark codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cow parquet datasource, we can reuse part of this code.
|
does this PR only aimed to support spark 3.2.0 not spark 3.2.1/3.2.2 or 3.3.0? or with following-up tasks to support other versions? |
|
@leesf thanks for your time to review this, I found that there are serveral problems with this pr, since upstream has revert v2 datasource, hudi/hudi-spark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/hudi/Spark3DefaultSource.scala Line 28 in 3adb571
|
|
@alexeykudinkin FYI |
30a035f to
208824b
Compare
208824b to
8156bda
Compare
a585bf1 to
a42568c
Compare
a42568c to
087d98c
Compare
| * Convert Filter to Catalyst Expression. If convert success return an Non-Empty | ||
| * Option[Expression],or else return None. | ||
| */ | ||
| def convertToCatalystExpression(filter: Filter, tableSchema: StructType): Option[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure we're not duplicating utilities we already have (there's HoodieCatalystExpressionUtils doing exactly the same thing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
| /** | ||
| * Create Like expression. | ||
| */ | ||
| def createLike(left: Expression, right: Expression): Expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we need this one for?
| * NOTE: This method is specific to Spark 3.2.0 | ||
| */ | ||
| private def createParquetFilters(args: Any*): ParquetFilters = { | ||
| def createParquetFilters(args: Any*): ParquetFilters = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we down-grading these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SparkBatch,I have reused some functions from outside object Spark32PlusHoodieParquetFileFormat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i don't think i understand. Can you please elaborate? Where these are used?
| override def build(): Scan = { | ||
| val relation = new DefaultSource().createRelation(new SQLContext(spark), options) | ||
| relation match { | ||
| case HadoopFsRelation(location, partitionSchema, dataSchema, _, _, options) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole idea of V2 integration is bypassing the V1 concepts that such as HadoopFsRelation which unfortunately are very limiting for Hudi features.
What's our plan here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, it is still v1 scan here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It‘s really a huge work to reimplement DataSourceV2 APIs for all hudi tables,My initial idea is to use the lower level V2 read interface to support features like SupportsPushDownFilters and SupportsRuntimeFiltering etc first. And I will continue to follow up on the implementation of new HudiTableScan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's no small effort, and we should definitely think about how we can approach it incrementally.
But we also need to be mindful that we can't keep things in transitory state for long -- we can't have the integration spread b/w V1 and V2 and we have to commit and migrate fully either way, w/ a clear plan for the deprecation of the one.
I'd suggest for feature of this scale we should actually write a proper RFC to cover all of these concerns. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this will be greatly helped by RFC-64 we're currently working on, so we'd sync and sequence our efforts to make sure there's minimal to no duplication in what we do
| // TODO: if you move this into the closure it reverts to the default values. | ||
| // If true, enable using the custom RecordReader for parquet. This only works for | ||
| // a subset of the types (no complex types). | ||
| val resultSchema: StructType = StructType(partitionSchema.fields ++ requiredSchema.fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to clone whole Hudi's custom ParquetFileFormat impl here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some minor change to construct createReaderFactory, we may extend suitable action to support more accurate task planning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i still i'm not sure i understand what we're copying this for. Can you please elaborate what exactly you're trying to modify here?
We should copy code from Spark only in exceptional circumstances when there's just no other way around it. Otherwise we should avoid doing that at all costs.
087d98c to
19b2ab9
Compare
|
|
||
| spark.sql(s"set hoodie.datasource.v2.read.enable=true") | ||
|
|
||
| val query =String.format("SELECT f.id, f.price, f.ts, f.dt, f.name FROM %s f JOIN dim d ON f.name = d.name AND d.id = 1 ORDER BY id", tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= String
|
@leesf @alexeykudinkin @XuQianJin-Stars Thanks for your time. I've been busy with other things these days, sorry for the late reply. |
| * NOTE: This method is specific to Spark 3.2.0 | ||
| */ | ||
| private def createParquetFilters(args: Any*): ParquetFilters = { | ||
| def createParquetFilters(args: Any*): ParquetFilters = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i don't think i understand. Can you please elaborate? Where these are used?
| new HoodieBatchScanBuilder(spark, hoodieCatalogTable, scanOptions) | ||
| } | ||
|
|
||
| private def buildHoodieScanConfig(caseInsensitiveStringMap: CaseInsensitiveStringMap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can inline this method (it's trivial)
| // TODO: if you move this into the closure it reverts to the default values. | ||
| // If true, enable using the custom RecordReader for parquet. This only works for | ||
| // a subset of the types (no complex types). | ||
| val resultSchema: StructType = StructType(partitionSchema.fields ++ requiredSchema.fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i still i'm not sure i understand what we're copying this for. Can you please elaborate what exactly you're trying to modify here?
We should copy code from Spark only in exceptional circumstances when there's just no other way around it. Otherwise we should avoid doing that at all costs.
| override def build(): Scan = { | ||
| val relation = new DefaultSource().createRelation(new SQLContext(spark), options) | ||
| relation match { | ||
| case HadoopFsRelation(location, partitionSchema, dataSchema, _, _, options) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's no small effort, and we should definitely think about how we can approach it incrementally.
But we also need to be mindful that we can't keep things in transitory state for long -- we can't have the integration spread b/w V1 and V2 and we have to commit and migrate fully either way, w/ a clear plan for the deprecation of the one.
I'd suggest for feature of this scale we should actually write a proper RFC to cover all of these concerns. What do you think?
| override def build(): Scan = { | ||
| val relation = new DefaultSource().createRelation(new SQLContext(spark), options) | ||
| relation match { | ||
| case HadoopFsRelation(location, partitionSchema, dataSchema, _, _, options) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this will be greatly helped by RFC-64 we're currently working on, so we'd sync and sequence our efforts to make sure there's minimal to no duplication in what we do
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing this stale PR. We're revisiting Spark Datasource V2 support in a new effort. cc @geserdugarov
Change Logs
NA
Impact
NA
Risk level (write none, low medium or high below)
medium
Contributor's checklist