-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-69] Support Spark Datasource for MOR table #1722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@umehrot2 take a look as well? |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garyli1019 Took an initial pass.. Overall
like to understand if we can wrap more than reusing code from other places.. Want to ensure COW/Snapshot is not affected in any way.
For the merging itself, I had a slightly different approach in mind, where we wrap ParquetFileFormat. and use it for reading the base parquet file using Spark's own record reader and we simply read the logs using our code... then merge can be implemented as another Iterator..
High level flow looks good.. and very exciting to see this PR
hudi-spark/src/main/scala/org/apache/hudi/RealtimeRelation.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/RealtimeRelation.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing to ensure is COW/Snapshot will use spark's native parquet reader.. I think we push the same path filter above and reuse the spark source here. so should work.. But good to test once with few queries and ensure there is no change in perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we can do this once we have the benchmarking framework ready.
hudi-spark/src/main/scala/org/apache/hudi/RealtimeRelation.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/RealtimeRelation.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/realtime/HoodieRealtimeParquetRecordReader.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/realtime/HoodieRealtimeParquetRecordReader.java
Outdated
Show resolved
Hide resolved
...ain/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieRealtimeInputFormat.scala
Outdated
Show resolved
Hide resolved
...ain/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieRealtimeInputFormat.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this codde re-used from some place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is from Spark 2.4 https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L445
This horrible erasure hack was updated in the latest master though...
|
@vinothchandar Thanks for the feedback. Your approach makes sense to me. If we can do it that way then we can reduce some maintenance overhead and be more flexible for the future Spark upgrade. I will try to see if I can do it that way. |
444c71a to
d90d4f3
Compare
|
Successfully got rid of those |
Codecov Report
@@ Coverage Diff @@
## master #1722 +/- ##
============================================
- Coverage 62.82% 59.79% -3.04%
- Complexity 3437 3610 +173
============================================
Files 401 439 +38
Lines 17091 19183 +2092
Branches 1698 1946 +248
============================================
+ Hits 10737 11470 +733
- Misses 5623 6924 +1301
- Partials 731 789 +58
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely have performance implications and be slower than spark's listing mechanism which uses spark context to parallelize listing (across the cluster) along with other optimizations. Also, we have already known HoodieROTablePathFilter to be really expensive specially when it comes to S3 and it is still a bottleneck for read performance in read optimized queries where it is used with parquet relation, where it is applied sequentially at the driver to all the files.
It might be worth for you to checkout my implementation once as well where I am re-using the InMemoryFileIndex implementation of spark to perform the listing so that all the listing optimizations of spark are retained. Also instead of passing it this expensive filter, which gets applied sequentially to each and every file, I am instead using fsView.getLatestBaseFiles to get the latest view of the files.
Also does this handle glob paths like s3://bucket/huditable/*/* ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great to know! We definitely can reuse your implementation there. At this point, I don't know if there is any other way to list all the file status without using HoodieROTablePathFilter. RFC-15 will make this more efficient as well.
No it doesn't support glob paths right now. I think we could put all the paths handling into your HudiSparkUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this does not support predicate pushdown or column pruning ? Seems like not because we need to implement PruneFilteredScan instead of TableScan to support these, and pass it down for reading to the parquet file format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does. I checked the physical plan of a filter query and the filter was pushed down.
IIUC, TableScan can automatically push down the filter and column pruning, like what we did in the incremental relation. PruneFilteredScan was usually used for Spark connector for other databases to pushdown Spark filter to db SQL query.
I found an example on spark-cassandra-connector
https://github.com/garyli1019/spark-cassandra-connector/blob/master/connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSourceRelation.scala#L367
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong here. We need the PruneFilteredScan to push down the filter and projection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not sure how this will play out with huge amounts of data containing thousands of files. We would be serializing passing over this huge map to each and every task, even though each spark task is supposed to work on only one or few file splits. We need to test how this plays out with extremely large number of files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here. Other possible implementations could be:
- Add it to Hadoop config, but it could cause the same issue as using the option hashmap.
- Concatenate the logPath to the parquetPath, then split path in the executor before doing the file split.(EDIT: Doesn't work, in theFileFormat, the file is already aPartitionedFiletype) - Search log based on parquetPath on each executor. This could put pressure on the name node as well. (EDIT: This probably doesn't work as well. Getting the meta client and file system view is very expensive.)
I personally prefer the second way if we can efficiently handle all the file listing on the driver side. I will see if it's possible to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should refrain from listing other executors. @umehrot2 do you have any suggestions here?
FWIW the amount of data would be proportional to the amount of partitions globbed which should not be as bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just something to think about, Spark forms file splits by default when you use a FileFormat type datasource. It can potentially split this parquet files (with multiple row groups) in between and send it to task for reading. Does this play out well with merging splits with corresponding log files ? I do think that it should be okay in this case, but wanted to point it out if you guys see any potential issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the Hive file split we have. If we split one parquet file into two splits, we will do the log scanning for the same logfile twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garyli1019 true.. but if the logs are small w.r.t parquet base, this might be okay actually.. For workloads with high write amplification, people tend to have much smaller file sizes anyway.. We should support the row group level parallel splitting for sure, otherwise, performance would be an issue.. btw Hive achieves this today.. We wrap each file split into a realtime file split..
|
Like @vinothchandar I do agree with the high level approach here, and thanks for putting out this PR 👍 However, I would highly recommend both of you to check out #1702 which is along similar lines, and solves some of the issues I see in this PR:
Happy to have more in-depth discussion on this and help get this to completion. |
|
Thanks for the detailed comments, @umehrot2 ! Will process and also look at your other pr. So, on some of the perf issues you pointed out; please raise 0.6.0 jiras with blocker priority, so we can get them fixed in the nexr major release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umehrot2 Thank you for your insightful comments! I will definitely take a closer look at your PR.
- I think
HoodieRealtimeFileFormatis necessary here. Because we need a certain level of customization e.g.HoodieParquetRecordReaderIteratorHoodieRealtimeFileSplitthat are not exposed to the user. But I do agree that it will give us some overhead on maintenance. Regarding the performance side, if most parts of the data files are stand-alone parquet files, the performance difference wouldn't be too large. - For
HoodieRealtimeFileFormat, we can supportpredicate pushdown, but notvectorized reader.Not sure if(EDIT:column pruningwill work whenvectorized readeris not availablecolumn pruningstill works when not using the vectorized reader). Because we need to merge on rows, we need to extract value at least from certain columns to do the comparison. We need the correct schema for the log file, a way to convert those fakeUnsafeRowto HoodieRecord type to merge with log e.t.c. I think this could be quite complicated but still possible to do in the future. - Yes, I will try to find a better way to send the paths to executors without broadcasting the whole map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does. I checked the physical plan of a filter query and the filter was pushed down.
IIUC, TableScan can automatically push down the filter and column pruning, like what we did in the incremental relation. PruneFilteredScan was usually used for Spark connector for other databases to pushdown Spark filter to db SQL query.
I found an example on spark-cassandra-connector
https://github.com/garyli1019/spark-cassandra-connector/blob/master/connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSourceRelation.scala#L367
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great to know! We definitely can reuse your implementation there. At this point, I don't know if there is any other way to list all the file status without using HoodieROTablePathFilter. RFC-15 will make this more efficient as well.
No it doesn't support glob paths right now. I think we could put all the paths handling into your HudiSparkUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here. Other possible implementations could be:
- Add it to Hadoop config, but it could cause the same issue as using the option hashmap.
- Concatenate the logPath to the parquetPath, then split path in the executor before doing the file split.(EDIT: Doesn't work, in theFileFormat, the file is already aPartitionedFiletype) - Search log based on parquetPath on each executor. This could put pressure on the name node as well. (EDIT: This probably doesn't work as well. Getting the meta client and file system view is very expensive.)
I personally prefer the second way if we can efficiently handle all the file listing on the driver side. I will see if it's possible to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the Hive file split we have. If we split one parquet file into two splits, we will do the log scanning for the same logfile twice.
|
Few major concerns here:
Major follow-ups after this PR:
|
|
Just finished reviewing bootstrap pr which blocks udits pr :). Will read up and review both your and udits or shortly |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need more work IMO.. lets keep hashing this out.. I will spend time on Spark datasource APIs more deeply this week and can come back with a concrete plan for this as well. Meantime, some thing to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cow, snapshot is the default anyway.
Why would globing like that error out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t think this change is necessary, right? RO view does map to snapshot query for cow. We may need to have two maps for cow and mor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused by the naming sometimes...
So for COW table, snapshot view = read optimized view
for MOR, snapshot view and read optimized view are different things.
With bootstrap, we will have one more view.
Can we call read optimized view -> parquet only(including bootstrap) snapshot view -> parquet(with bootstrap) merge with log regardless of table type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this mapping because we should be able to do RO view on MOR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.. there are no more views.. we did a renaming exercise to clear things up as "query types" .. with that there should be no confusion.. our docs are consistent with this as well.. On COW there is in fact no RO view.. so this change has to be done differently, if you need for MOR..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my previous comments are confusing, let me rephrase.
What I trying to do here is to not change the query behavior. Since before we don't support snapshot query for MOR, so RO and snapshot query type will behave the same regardless of its COW or MOR.
If we don't change this mapping, the user will have different behavior after upgrade to the next release. If they are using VIEW_TYPE_READ_OPTIMIZED_OPT_VAL(deprecated) on MOR in their code, after upgrade to the next release, the code will run snapshot query instead of RO query. This could give users surprise even this key was deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f they are using VIEW_TYPE_READ_OPTIMIZED_OPT_VAL(deprecated) on MOR in their code, after upgrade to the next release, the code will run snapshot query instead of RO query.
we have been logging warning for sometime on the use of the deprecated configs. and so I think its fair to do the right thing here moving forward and call this out in the release notes. Let me push some changes..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is what we use for both cow and mor.. see comment above. Would be best to keep behavior for cow same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not as efficient as the default DataSource.apply().resolveRelation() yet, due to the filter pushdown and column pruning is not supported yet. But we will get there soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:indentation
This changes behavior for cow/snapshot. I feel we should have two separate methods - createCopyOnWriteRelation (No change in behavior) and createMergeOnReadRelation (changes for this poor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard without a path handler. Now we tell the user to add glob pattern to load(basePath + "/*/*/*/*") for COW table and the glob will be in the basePath. The Spark default DataSource.apply().resolveRelation() is able to handle the glob, but our custom relation not able to handle this. This is why our incremental relation requires .load(basePath) only. Udit's PR has this path handler so we will have a unified place to handle all the paths.
I think we can change the default option to READ_OPTIMIZED, so the user side has no impact. Currently spark datasource only supports snapshot on COW anyway, which is the same as READ_OPTIMIZED. We can switch back later with the path handler https://github.com/apache/hudi/pull/1702/files#diff-683cf2c70477ed6cc0a484a2ae494999R72
hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garyli1019 true.. but if the logs are small w.r.t parquet base, this might be okay actually.. For workloads with high write amplification, people tend to have much smaller file sizes anyway.. We should support the row group level parallel splitting for sure, otherwise, performance would be an issue.. btw Hive achieves this today.. We wrap each file split into a realtime file split..
...ain/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieRealtimeInputFormat.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid all these copied code from ParquetFileFormat by wrapping or overriding? This should be doable?
I am concerned about the amount of borrowed code in this file..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we have re-implemented the merge by hand again in this class.. I was trying to explore if we can reuse the existing HoodieRecordReader classes by templatizing for Row instead of ArrayWritable .. That's atleast the more longer term option to pursue..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a single parquet file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be only a portion of a file. From spark description A part (i.e. "block") of a single file that should be read, along with partition column values that need to be prepended to each row.
garyli1019
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. I will address the comments soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like incremental relation. We need the base path to get the hudi table view. We need a smarter path handler to handle the glob path and udit's pr have this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused by the naming sometimes...
So for COW table, snapshot view = read optimized view
for MOR, snapshot view and read optimized view are different things.
With bootstrap, we will have one more view.
Can we call read optimized view -> parquet only(including bootstrap) snapshot view -> parquet(with bootstrap) merge with log regardless of table type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
Outdated
Show resolved
Hide resolved
...a/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetRecordReaderIterator.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes back to the MapReduce v1 vs v2 API. Hive use V1 and Spark use V2. Spark also warp V2(from parquet) into its own record reader. I couldn't find a way to make Hive and Spark share one record reader. The vectorized reader would be another story as well.
...a/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetRecordReaderIterator.scala
Outdated
Show resolved
Hide resolved
...a/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetRecordReaderIterator.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be only a portion of a file. From spark description A part (i.e. "block") of a single file that should be read, along with partition column values that need to be prepended to each row.
610e727 to
c80e26f
Compare
hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala
Outdated
Show resolved
Hide resolved
|
@vinothchandar Thanks for reviewing! I created tickets for the follow-up work. All the file listing and globing can be improved after @umehrot2 's PR merged. |
a5ee2fb to
d2daeb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.. there are no more views.. we did a renaming exercise to clear things up as "query types" .. with that there should be no confusion.. our docs are consistent with this as well.. On COW there is in fact no RO view.. so this change has to be done differently, if you need for MOR..
| // TODO: Smarter path handling | ||
| val metaClient = try { | ||
| val conf = sqlContext.sparkContext.hadoopConfiguration | ||
| Option(new HoodieTableMetaClient(conf, path.get, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would n't be problematic if path is a glob and not the actual basePath of the table? COW/Snapshot query can for e.g do this and I think we should handle the same for MOR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we have:
- RO, Snapshot query for COW: Support glob and basePath
- Snapshot for MOR: only support basePath
- Incremental: Only support basePath
What I am trying to do here is:
- If the
pathcontains glob, fall back to RO. This is the current behavior. Create metaClient will throw an Exception but handled below. - If the
pathis basePath, we create the metaClient. If COW table, go RO relation. If MOR, go snapshot relation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot for MOR: only support basePath
Let me think about this more. We need to support some form of globbing for MOR/Snapshot query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Udit's PR has this path handling. Should we merge part of his PR first? https://github.com/apache/hudi/pull/1702/files#diff-9a21766ebf794414f94b302bcb968f41R31
With this, we can handle user to .load(basePath) or .load(basePath + "/*/*") for COW, MOR and incremental.
| val conf = sqlContext.sparkContext.hadoopConfiguration | ||
| Option(new HoodieTableMetaClient(conf, path.get, true)) | ||
| } catch { | ||
| case e: HoodieException => Option.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just error out there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used this as a flag that the path is not basePath. This is a temporary solution to not change the query behavior.
This will be handled better with: https://github.com/apache/hudi/pull/1702/files#diff-9a21766ebf794414f94b302bcb968f41R31
|
|
||
| override def shortName(): String = "hudi" | ||
|
|
||
| private def getReadOptimizedView(sqlContext: SQLContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can rename to something like getFilteredBaseFileRelation(). Again, don't want to bring back view nomenclature into the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do
| * This class is an extension of ParquetFileFormat from Spark SQL. | ||
| * The file split, record reader, record reader iterator are customized to read Hudi MOR table. | ||
| */ | ||
| class HoodieParquetRealtimeFileFormat extends ParquetFileFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still mulling if this file can be simpler.. working on it..
heads up : if we have to re-use code, we need to attribute in NOTICE/LICENSE as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the FileFormat approach, we probably can't avoid copy some Spark code. For datasource V2, we need to copy more code since Spark 3 use case class for all the FileFormat
I will try to use udit's RDD approach https://github.com/apache/hudi/pull/1702/files#diff-809772c649e85ffb321055d9871e37e0R39
I think that's doable. In that approach, we can get rid of this, but need to try that after his PR merged. Will need to reuse many code from that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garyli1019 it should be possible to call super.buildReader..() get the Iterator[InternalRow] for the parquet base file alone and then use the class above to merge right?
It will avoid a ton of code from this file..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could come with a schema issue. super.buildReader..() could use a vectorized reader that only read a few columns and right now I didn't implement merging Avro with InternalRow with different schema. Right now we can only merge row by row with the same schema.
We will get rid of this FileFormat approach. The RDD approach is cleaner. https://github.com/apache/hudi/pull/1702/files#diff-809772c649e85ffb321055d9871e37e0R75
| * Custom payload is not supported yet. This combining logic is matching with [OverwriteWithLatestAvroPayload] | ||
| * @param rowReader ParquetRecordReader | ||
| */ | ||
| class HoodieMergedParquetRowIterator(private[this] var rowReader: ParquetRecordReader[UnsafeRow], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@garyli1019 @vinothchandar : Yes, I am planning to address the bootstrap PR comments and also give review comments for @umehrot2 changes by this weekend. @umehrot2 : I know this is not ideal and apologies for not able to give a review quickly. |
|
@garyli1019 actually @umehrot2 's approach there is wrapping the FileFormat as opposed to extending, which is great as well.. If you can wait, we can land the bootstrap changes and rework this on top. But, I feel what you will be doing here is a very different merge than @umehrot2 's PR and the code to wrap ParquetFileFormat is not that large.. So you can also in parallel just proceed? Do you just want the two util classes landed? SparkUtils/TablePathUtils? your call. lmk what you think. |
|
@vinothchandar I agree we should use @umehrot2 RDD approach.
Yes, I will change this PR in parallel.
Yes, @umehrot2 do you mind making a separate PR for these two classes? |
@vinothchandar @garyli1019 thank you guys for reviewing the PR. Will look at the comments and start addressing. As for the |
|
@garyli1019 should we close this in favor of #1848 ? |
What is the purpose of the pull request
This PR implement Spark Datasource for MOR table
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.