[HUDI-69] Support Spark Datasource for MOR table - RDD approach #1848

garyli1019 · 2020-07-20T01:06:48Z

What is the purpose of the pull request

This PR implements Spark Datasource for MOR table in the RDD approach.

ParquetFileFormat approach PR: #1722

Brief change log

Implemented SnapshotRelation
Implemented HudiMergeOnReadRDD
Implemented separate Iterator to handle merge and unmerge record reader.

Verify this pull request

This change added tests and can be verified as follows:

Added TestMORDataSource to verify this feature.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-commenter · 2020-07-20T01:39:14Z

Codecov Report

Merging #1848 into master will increase coverage by 1.01%.
The diff coverage is 77.77%.

@@             Coverage Diff              @@
##             master    #1848      +/-   ##
============================================
+ Coverage     62.32%   63.33%   +1.01%     
- Complexity     3508     3522      +14     
============================================
  Files           420      410      -10     
  Lines         17817    17400     -417     
  Branches       1754     1729      -25     
============================================
- Hits          11105    11021      -84     
+ Misses         5959     5630     -329     
+ Partials        753      749       -4

Flag	Coverage Δ	Complexity Δ
#hudicli	`?`	`?`
#hudiclient	`79.25% <ø> (+0.02%)`	`1257.00 <ø> (-1.00)`	⬆️
#hudicommon	`54.78% <ø> (ø)`	`1509.00 <ø> (ø)`
#hudihadoopmr	`38.72% <0.00%> (?)`	`163.00 <0.00> (?)`
#hudihivesync	`72.25% <ø> (ø)`	`121.00 <ø> (ø)`
#hudispark	`53.51% <85.36%> (+4.27%)`	`146.00 <22.00> (+24.00)`
#huditimelineservice	`63.47% <ø> (ø)`	`47.00 <ø> (ø)`
#hudiutilities	`74.64% <ø> (-0.07%)`	`279.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...hadoop/config/HadoopSerializableConfiguration.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...ain/scala/org/apache/hudi/HudiMergeOnReadRDD.scala	`77.77% <77.77%> (ø)`	`11.00 <11.00> (?)`
.../main/scala/org/apache/hudi/SnapshotRelation.scala	`91.30% <91.30%> (ø)`	`10.00 <10.00> (?)`
...main/scala/org/apache/hudi/DataSourceOptions.scala	`93.54% <100.00%> (ø)`	`0.00 <0.00> (ø)`
...src/main/scala/org/apache/hudi/DefaultSource.scala	`78.26% <100.00%> (+7.67%)`	`10.00 <1.00> (+3.00)`
...in/scala/org/apache/hudi/IncrementalRelation.scala	`72.41% <100.00%> (ø)`	`13.00 <0.00> (-1.00)`
...he/hudi/table/action/commit/UpsertPartitioner.java	`95.71% <0.00%> (-1.41%)`	`33.00% <0.00%> (-1.00%)`
...g/apache/hudi/utilities/sources/AvroDFSSource.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
...che/hudi/utilities/sources/HiveIncrPullSource.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
... and 71 more

garyli1019 · 2020-07-21T05:28:10Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

PrunedFilteredScan will change the behavior of ParquetRecordReader inside ParquetFileFormat even we are not using the vectorized reader. Still trying to figure out why... I will follow up with PrunedFilteredScan in a separate PR.

can we file a JIRA for this. and is it possible to target this for 0.6.0 as well?

https://issues.apache.org/jira/browse/HUDI-1050

garyli1019 · 2020-07-21T05:29:36Z

@vinothchandar @umehrot2 Ready for review. Thanks!

vinothchandar

Made a first pass. Looks reasonable, but we need some changes before we can land this.

Thanks for the persistent efforts, @garyli1019 !

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/config/HadoopSerializableConfiguration.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java

vinothchandar · 2020-07-21T06:28:21Z

hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala

should we just make a new datasource level config for this. and translate. mixing raw InputFormat level configs here, feels a bit problematic in terms of long term maitenance>

Agree. Maybe something like SNAPSHOT_READ_STRATEGY? So we can control the logic for merge unmerge mergeWithBootstrap e.t.c

added MERGE_ON_READ_PAYLOAD_KEY and MERGE_ON_READ_ORDERING_KEY. Then we use the payload to do all the merging.

payload is actually stored inside hoodie.properties . let me take a stab at how to handle this

vinothchandar · 2020-07-21T07:00:30Z

hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala

this seems to indicate that we will keep scanning the remaining entries in the log and hand them out if dataFileIterator runs out. We need to be careful about how it interplays with split generation. specifically, only works if the each base file has only 1 split..

HudiMergeOnReadFileSplit(partitionedFile, logPaths, latestCommit, metaClient.getBasePath, maxCompactionMemoryInBytes, skipMerge)

here partitionedFile has to be a single file and not an input Split. otherwise we will face an issue that the log entry belongs to a different split and the ultimate query will have duplicates.

Yes, this HudiMergeOnReadFileSplit is based on we pack one baseFile into one partitionedFile during the buildFileIndex. I believe Spark won't split this secretly somewhere...

FileFormat could break up a file into splits I think. We can wait for @umehrot2 also to chime in here.

One partitioned file will go into one split if I understand this correctly https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L355
That would be difficult to handle if Spark split one parquet into two splits somewhere else.

@vinothchandar @garyli1019 Spark's FileFormat uses can split the parquet files up if there are multiple row groups in the file. In Spark's implementation one PartitionedFile is not a complete file, that is why you can see the start position and length in that.

But in both this implementation and in bootstrap case we are packaging complete file as a split, to avoid the complexity of dealing with partial splits.

hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala

vinothchandar · 2020-07-21T07:02:13Z

hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala

is the remove needed. this map is often spillable.. we should just make sure the remove does not incur additional I/O or soemethng

This is tricky. We need something like skipableIterator.

The hudiLogRecords are immutable. So I am making a copy of the keySet() to handle the remove and iterator it through to read back from hudiLogRecords.

It's possible that the parquet has duplicate keys and we need to merge with a single logRecord multiple times. This is the behavior of how the compaction handles duplicate keys.

vinothchandar · 2020-07-21T07:03:15Z

hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala

same comment here about double checking if this is actually ok

vinothchandar · 2020-07-21T07:06:15Z

hudi-spark/src/main/scala/org/apache/hudi/HudiMergeOnReadRDD.scala

this implicitly assumes OvewriteWithLatestPayload ? can we just convert the parquet row as well to Avro and then perform the merge actually calling the right API. HoodieRecordPayload#combineAndGetUpdateValue() ? This is a correctness issue we need to resolve in the PR..
ideally, adding a test case as well to go along with this would be good

vinothchandar · 2020-07-21T07:08:00Z

@umehrot2 can you also please make a quick second pass.

garyli1019

Addressed the comments and added support for custom payload. Added tests to demonstrate delete operation with OverwriteWithLatestAvroPayload

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/config/HadoopSerializableConfiguration.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java

garyli1019 · 2020-07-22T05:35:04Z

hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala

added MERGE_ON_READ_PAYLOAD_KEY and MERGE_ON_READ_ORDERING_KEY. Then we use the payload to do all the merging.

garyli1019 · 2020-07-22T05:56:40Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

https://issues.apache.org/jira/browse/HUDI-1050

umehrot2 · 2020-07-22T21:12:09Z

@umehrot2 can you also please make a quick second pass.

@vinothchandar @garyli1019 Sorry for the late response. Plan to review it later today.

umehrot2

Took a look and have some high level comments. Will take another more thorough line by line pass of the code and unit tests once we have addressed the main points.

hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

umehrot2 · 2020-07-23T08:41:07Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

Shouldn't we call it something like MergeOnReadSnapshotRelation ?

umehrot2 · 2020-07-23T08:45:37Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

Instead of doing this, you can just do:

val tableSchema = schemaUtil.getTableAvroSchema

good to know. thanks

umehrot2 · 2020-07-23T09:08:51Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

Instead of using prefix Hoodie for all the newly defined classes, shouldn't we be using Hudi. Isn't that where the community is headed towards ?

This problem is tricky. Both sides make sense to me. It seems like impossible to completely switch from hoodie to hudi everywhere. Should we define a standard for the naming convension? Like class name -> hoodie, package name -> hudi @vinothchandar

umehrot2 · 2020-07-23T20:20:52Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

Possibly something like HoodieLogFileFormat might make sense to do in future, as it will clearly extract out the Log files reading logic.

umehrot2 · 2020-07-23T20:36:01Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

We should try to support filter push-down. Again for the more straight forward scenarios which I mentioned in my other comment, it should be possible to just pass down the user filters in the reader.

However, for Log file merging scenario we may have to take care of the following scenario:

Reading base file filtered out say Row X because of filter push-down.

Row X had been updated in the log file and has an entry.

While merging we need some way to tell that Row X should be filtered out from log file as well, otherwise we may end up still returning it in the result, because based on the merging logic I see that any remaining rows in log file which did not have corresponding key in base file are just appended and returned in the result.

umehrot2 · 2020-07-23T20:39:37Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

Just for my understanding, what is this use-case where we want to return un-merged rows ? Do we do something similar for MOR queries through input format where we want to return un-merged rows ?

yes we have this option for Hive https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeUnmergedRecordReader.java
The performance will be better without merging. We can avoid the type conversion at least Row -> Avro -> Merge -> Avro -> Row

umehrot2 · 2020-07-23T20:50:26Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

Does this need explicit casting ?

I followed the HadoopRDD implementation here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L120

umehrot2 · 2020-07-23T20:59:06Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

@vinothchandar This seems like something we can consider using at other places in Hudi code like AvroConversionHelper to convert Avro Records to Rows, instead of maintaining the conversion code in-house.

agree here. Our AvroConversionHelper is handling Row, which is not an extension of InternalRow. If we don't need Row specifically, I think we can adapt to the Spark Internal serializer.

umehrot2 · 2020-07-23T21:01:55Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

Instead of looping over baseFileIterator and performing the check whether that key exists in logRecords, will it be more efficient to do it the other way round. Loop over logRecords and perform merge. In the end append all the remaining base file rows.

Also is this a good practice to perform the actual fetching in hasNext function, instead of next ?

Actually on further thought the first point may not be possible, since if we iterator over log records it will be difficult to find the corresponding base parquet record using record key (for merging).

The log scanner scans all the log files in one batch and handled the merging internally. The output is a hashmap we can use directly. This logRecordsIterator is just looping through the hashmap and doesn't load the row one by one like the daseFileIterator.

This is a little bit tricky. If the hasNext return true, but next() doesn't return a value, Spark will throw an exception. In our logic, we don't know hasNext will be true of false until we find the qualified record to read. For example, 100 records in base file and 100 delete records in the log file. We will read 0 row and hasNext should return false in the first call, but we have iterated through the whole base file already. There is a test case for this example.

garyli1019

@umehrot2 Thanks for reviewing! I clarified some questions in the comments.

hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

garyli1019 · 2020-07-23T21:44:50Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

This problem is tricky. Both sides make sense to me. It seems like impossible to completely switch from hoodie to hudi everywhere. Should we define a standard for the naming convension? Like class name -> hoodie, package name -> hudi @vinothchandar

garyli1019 · 2020-07-23T21:46:33Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

I followed the HadoopRDD implementation here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L120

garyli1019 · 2020-07-23T21:49:38Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

yes we have this option for Hive https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeUnmergedRecordReader.java
The performance will be better without merging. We can avoid the type conversion at least Row -> Avro -> Merge -> Avro -> Row

garyli1019 · 2020-07-23T21:52:05Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

agree here. Our AvroConversionHelper is handling Row, which is not an extension of InternalRow. If we don't need Row specifically, I think we can adapt to the Spark Internal serializer.

garyli1019 · 2020-07-23T22:05:25Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala

The log scanner scans all the log files in one batch and handled the merging internally. The output is a hashmap we can use directly. This logRecordsIterator is just looping through the hashmap and doesn't load the row one by one like the daseFileIterator.

This is a little bit tricky. If the hasNext return true, but next() doesn't return a value, Spark will throw an exception. In our logic, we don't know hasNext will be true of false until we find the qualified record to read. For example, 100 records in base file and 100 delete records in the log file. We will read 0 row and hasNext should return false in the first call, but we have iterated through the whole base file already. There is a test case for this example.

garyli1019 · 2020-07-23T22:06:19Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

good to know. thanks

garyli1019 · 2020-07-23T22:15:43Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

When I pushdown nothing and pass the full schema as user requested schema, with simply changing from TableScan() to PrunedFilteredScan, the behavior of the parquet reader was changed and not reading the correct schema. I need to dig deeper here.
Let's focus on making the basic functionality work in this PR. I will figure this out with a follow-up PR.

garyli1019 · 2020-07-23T22:21:53Z

hudi-spark/src/main/scala/org/apache/hudi/SnapshotRelation.scala

Missing the filter for log file probably ok because Spark will apply the filter again after the pushdown filter. Description https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L268

garyli1019 · 2020-07-26T23:34:18Z

Added support for PruneFilterScan. Please review this PR again. Thank you!

bvaradar · 2020-07-29T07:09:39Z

@garyli1019 : I took this patch and ran it in EMR (Spark-2.4.5-amzn-0). I got the following exceptions when loading S3 dataset.

I am using hudi-spark-bundle in the spark session.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/29 06:57:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-172-31-33-232.us-east-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1595775804042_9837).
Spark session available as 'spark'.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 2.4.5-amzn-0
//

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val dfh = spark.read.format("hudi").load("s3a://hudi.streaming.perf/orders_stream_hudi_mor_4//")
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:144)
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:141)
at org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:75)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided

scala>

Have you seen this issue before ?

garyli1019 · 2020-07-29T17:28:36Z

@bvaradar Thanks for trying this out. java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile looks strange. I will try it out on my production today to see if I can reproduce.

garyli1019 · 2020-07-30T04:12:56Z

@bvaradar I tested on Spark 2.4.0 cdh release with a small dataset, and found a broadcast configuration issue. Pushed a new commit with the fix. Now this work fine on my cluster. I will test a larger dataset tomorrow.
I couldn't reproduce java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile. Is this class somehow missing on aws release? Are you able to import from the spark-shell?

garyli1019 · 2020-08-01T04:23:06Z

Tested on 100GB MOR table. A few partitions have 100% duplicate upsert log file, the other has parquet files only.
For parquet files only partitions, the SNAPSHOT query is as efficient as the READ_OPTIMIZED query. The file split with log files is expensive but is expected.
For one 50MB parquet file, the log file was ~1GB. Each file split has been loaded as one task.
Count performance for 50MB parquet + 1GB log:
merge: 40s
unmerge: 40s
Show performance. Because data source V1 doesn't support limit(), so it will just scan the whole file.
without column pruning: df_mor.show(10) took 40s
with column pruning: df_mor.select("_hoodie_commit_time").show(10) took 27s
@vinothchandar @umehrot2 @bvaradar

vinothchandar · 2020-08-04T03:33:34Z

@garyli1019 thanks for testing this! Will review again.

can we resolve the addressed comments in this PR, and also do a small summary on what the follow up work here is?
custom payload support? (sorry if I am re-asking this, just switching back to this again)

garyli1019 · 2020-08-04T03:49:18Z

small summary on what the follow-up work here is?

@vinothchandar For 0.6.0 release, the only one left is incremental pulling. I am currently working on it and will probably get it done by tomorrow. Maybe we can wait until tomorrow to review the whole thing.
The vectorized reader and pruning are supported in the current version.

custom payload support?

Yes, will let the HoodieMergedLogRecordScanner handle the payload loading and only need to specify merge or skipmerge when running the query. Included unit test for delete of OverwriteWithLatestAvroPayload to verify custom payload mechanism.

vinothchandar · 2020-08-05T14:53:13Z

@garyli1019 mind rebasing again please? I think you had some conflicts with #1752

vinothchandar · 2020-08-06T05:58:25Z

@bvaradar @garyli1019 I laid out the data from the test output here https://gist.github.com/vinothchandar/28cee665c36f806b562efc88821c9061

if you notice, the compaction here succeeded

 LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/20200805162410.commit; isDirectory=false; length=7224; replication=1; blocksize=33554432; modification_time=1596644652000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false},

I can see marker files. (but why though, should nt they be deleted if the commit is successful, guess that a question for me) even. but no new base files.

LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F03/4024aaba-eb81-4c3d-8571-993637dbc16d_5-794-5016_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F03/afdb1be6-9dd3-4606-9441-e1e4eb771998_0-794-5011_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F01/01fb19fe-d0cc-43d1-806d-b5e48bb806b3_3-794-5014_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F01/28ce1ffe-d00a-45fc-88f1-ec6705544c08_1-794-5012_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F02/765cbc30-12b7-4e81-afd3-c0a0f8d7a41d_2-794-5013_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F02/b93eb05a-36af-40a5-86f4-d7c3159991f4_4-794-5015_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false},

seems like a legit issue to me. cannot imagine how it can happen though

vinothchandar · 2020-08-06T06:00:28Z

rerun tests

garyli1019 · 2020-08-06T06:06:52Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala

@vinothchandar Looks like some unrelated change was added during the rebase. Maybe this is related to the issue.

nvm, this was intended based on previous comments to keep the consistency with others.

this was something that i should have caught in the earlier pr . I renamed it here, thinking we will be landing this quickly. oh well. lets just get the test to pass

vinothchandar · 2020-08-06T06:15:05Z

@bvaradar locally, I can see all the 12 files generated as expected. I cannot imagine, why this can happen in ci alone.

vinothchandar · 2020-08-06T08:24:03Z

@garyli1019 I am afraid this has something to do with the changes we for InMemoryFileIndex or sth made in the pr .

>>>> TestBootstrap : 
	 files:[file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet]
	 numVersions:2
	 numFiles:6
	 bootstrapBasePath:file:/tmp/junit3878890598586882351/data/_SUCCESS
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
	 basePath:file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/.44f3a4f0-ec41-4a43-889e-b133bccaaf40_00000000000001.log.1_0-937-6957
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_0-895-6505_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/.hoodie_partition_metadata
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-943-6973_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-895-6505_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/.830616aa-e406-4155-aa63-c9c80d15212d_00000000000001.log.1_1-937-6958
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_5-943-6978_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/.hoodie/00000000000001.deltacommit.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081528.compaction.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081528.compaction.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081524.deltacommit.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/00000000000001.deltacommit.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081524.deltacommit
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.aux/20200806081528.compaction.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.aux/.bootstrap/.fileids/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081517.restore.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081524.deltacommit.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_5-943-6978_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-943-6973_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_3-943-6976_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-943-6974_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_4-943-6977_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-943-6975_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_0-895-6505_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-895-6505_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_1-895-6506_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-895-6506_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-895-6507_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_2-895-6507_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/hoodie.properties
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081528.commit
		file:/tmp/junit3878890598586882351/dataset/.hoodie/00000000000001.deltacommit
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081517.restore
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_1-895-6506_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/.562f9e3c-29a4-4f3b-9405-0796eaea0717_00000000000001.log.1_4-937-6961
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-895-6506_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-943-6974_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/.hoodie_partition_metadata
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/.cc465d37-07a6-419b-86a0-7756d1dd8ca4_00000000000001.log.1_5-937-6962
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_3-943-6976_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-895-6507_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/.6b045efb-f008-44e9-9e44-25f7ed7c7e36_00000000000001.log.1_3-937-6960
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/.ae7740f7-b986-48a5-9d3c-2a784f0ee262_00000000000001.log.1_2-937-6959
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/.hoodie_partition_metadata
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_2-895-6507_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_4-943-6977_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-943-6975_20200806081528.parquet

May be the path filter getting set, but not being removed or something? I can now see all 12 files present, as expected. but spark.read.parquet() picks up just 6, the latests parquet files

+----------------------------------------------------------------------+
|_hoodie_file_name                                                     |
+----------------------------------------------------------------------+
|44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-943-6973_20200806081528.parquet|
|ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-943-6975_20200806081528.parquet|
|6b045efb-f008-44e9-9e44-25f7ed7c7e36_4-943-6977_20200806081528.parquet|
|cc465d37-07a6-419b-86a0-7756d1dd8ca4_3-943-6976_20200806081528.parquet|
|830616aa-e406-4155-aa63-c9c80d15212d_5-943-6978_20200806081528.parquet|
|562f9e3c-29a4-4f3b-9405-0796eaea0717_1-943-6974_20200806081528.parquet|
+----------------------------------------------------------------------+

vinothchandar · 2020-08-06T14:24:46Z

[ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 66.614 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
[ERROR] org.apache.hudi.functional.TestCOWDataSource.testStructuredStreaming  Time elapsed: 25.766 s  <<< ERROR!
java.util.concurrent.ExecutionException: Boxed Error
Caused by: org.opentest4j.AssertionFailedError: expected: <2> but was: <3>

@garyli1019 this is failing now. See my last fix. just unsetting the path filter after resolving the relation seems to help get over the issue. So the root issue there was the path filter still kicking in for spark.read.format('parquet') (which is kind of expected even)

garyli1019 · 2020-08-06T17:42:35Z

[ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 66.614 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
[ERROR] org.apache.hudi.functional.TestCOWDataSource.testStructuredStreaming  Time elapsed: 25.766 s  <<< ERROR!
java.util.concurrent.ExecutionException: Boxed Error
Caused by: org.opentest4j.AssertionFailedError: expected: <2> but was: <3>
@garyli1019 this is failing now. See my last fix. just unsetting the path filter after resolving the relation seems to help get over the issue. So the root issue there was the path filter still kicking in for spark.read.format('parquet') (which is kind of expected even)

@vinothchandar agree. I can see the InMemoryFileIndex warning on every test following the SparkStreaming test, but don't see it if I run some COW test alone. I will try to change the test set up to clean the spark context properly.

- This PR implements Spark Datasource for MOR table in the RDD approach. - Implemented SnapshotRelation - Implemented HudiMergeOnReadRDD - Implemented separate Iterator to handle merge and unmerge record reader. - Added TestMORDataSource to verify this feature.

vinothchandar · 2020-08-07T05:26:13Z

hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

      options = optParams)
      .resolveRelation()
+
+    sqlContext.sparkContext.hadoopConfiguration.unset("mapreduce.input.pathFilter.class")


@bvaradar keeping this code actually. given now, we are adding support for MOR snapshot query, if we don't unset this, then if you do a RO query and then a snapshot query, then this will filter out all except latest base files. which will be problematic.
cc @garyli1019 does this make sense? Let me see if I can add a test case for this

yes make sense. We unset this before the incremental query. When we only have two data source query type, it's fine to unset before running incremental query, but right now we have to unset this here. Still wondering why the local build is able to pass though...

- We can now revert the change made in DefaultSource

vinothchandar · 2020-08-07T06:39:41Z

@garyli1019 I wrote a test for this. Seems like this is actually not a problem. So reverted the unset for now. Please check my last commit.

now, if CI passes this timee we can land (fingers crossed)

vinothchandar · 2020-08-07T07:28:51Z

@garyli1019 Thanks for being so patient, and amazing and persevering ! this is now landed.! congrats!

luffyd · 2020-09-04T20:57:53Z

@garyli1019 : I took this patch and ran it in EMR (Spark-2.4.5-amzn-0). I got the following exceptions when loading S3 dataset.

I am using hudi-spark-bundle in the spark session.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/29 06:57:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-172-31-33-232.us-east-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1595775804042_9837).
Spark session available as 'spark'.
Welcome to

/ / ___ ****/ /** \ / _ / _ `/ __/ '/ /**/ .__/,// //\ version 2.4.5-amzn-0 //

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val dfh = spark.read.format("hudi").load("s3a://hudi.streaming.perf/orders_stream_hudi_mor_4//")
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:144)
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:141)
at org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:75)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided

scala>

Have you seen this issue before ?

I am seeing this exception with latest Hudi, how do we get it resolved?

luffyd · 2020-09-04T21:21:22Z

@garyli1019 : I took this patch and ran it in EMR (Spark-2.4.5-amzn-0). I got the following exceptions when loading S3 dataset.
I am using hudi-spark-bundle in the spark session.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/29 06:57:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-172-31-33-232.us-east-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1595775804042_9837).
Spark session available as 'spark'.
Welcome to

/ / ___ ****/ /** \ / _ / _ `/ __/ '/ /**/ .__/,// //\ version 2.4.5-amzn-0 //
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val dfh = spark.read.format("hudi").load("s3a://hudi.streaming.perf/orders_stream_hudi_mor_4//")
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:144)
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:141)
at org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:75)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided
scala>
Have you seen this issue before ?

I am seeing this exception with latest Hudi, how do we get it resolved?

I see Hudi uses spark-sql_2.11-2.4.4.jar and EMR uses /usr/lib/spark/jars/spark-sql_2.11-2.4.5-amzn-0.jar
Not sure If it makes a difference for that PartitionedFile class

garyli1019 · 2020-09-04T23:58:30Z

@luffyd Thanks for reporting. I created a ticket to track this: https://issues.apache.org/jira/browse/HUDI-1270

garyli1019 force-pushed the HUDI-69-RDD branch from 6617297 to 45015db Compare July 21, 2020 05:15

garyli1019 changed the title ~~WIP[HUDI-69] Support Spark Datasource for MOR table - RDD approach~~ [HUDI-69] Support Spark Datasource for MOR table - RDD approach Jul 21, 2020

garyli1019 force-pushed the HUDI-69-RDD branch from 45015db to bdda1d6 Compare July 21, 2020 05:23

garyli1019 commented Jul 21, 2020

View reviewed changes

vinothchandar mentioned this pull request Jul 21, 2020

[HUDI-69] Support Spark Datasource for MOR table #1722

Closed

5 tasks

vinothchandar reviewed Jul 21, 2020

View reviewed changes

garyli1019 force-pushed the HUDI-69-RDD branch 2 times, most recently from a4c7069 to 41d1d05 Compare July 22, 2020 06:02

garyli1019 commented Jul 22, 2020

View reviewed changes

garyli1019 force-pushed the HUDI-69-RDD branch 2 times, most recently from f38c76c to 5b051e5 Compare July 22, 2020 17:43

umehrot2 reviewed Jul 23, 2020

View reviewed changes

garyli1019 commented Jul 23, 2020

View reviewed changes

garyli1019 force-pushed the HUDI-69-RDD branch from 5b051e5 to eefad4b Compare July 26, 2020 23:23

garyli1019 force-pushed the HUDI-69-RDD branch 2 times, most recently from d8ef9d3 to 4b05f0f Compare July 27, 2020 20:39

vinothchandar self-assigned this Jul 29, 2020

garyli1019 force-pushed the HUDI-69-RDD branch from 4b05f0f to e6c7756 Compare July 29, 2020 04:43

garyli1019 mentioned this pull request Jul 30, 2020

[SUPPORT] Failed to get record from HoodieMergedLogRecordScanner #1890

Closed

garyli1019 force-pushed the HUDI-69-RDD branch from 1bbb23d to a053967 Compare August 5, 2020 16:21

vinothchandar force-pushed the HUDI-69-RDD branch from a053967 to fc5425d Compare August 6, 2020 05:35

vinothchandar force-pushed the HUDI-69-RDD branch from fc5425d to dcd2663 Compare August 6, 2020 06:02

garyli1019 commented Aug 6, 2020

View reviewed changes

vinothchandar force-pushed the HUDI-69-RDD branch 3 times, most recently from a9533a0 to f70258d Compare August 6, 2020 07:48

vinothchandar force-pushed the HUDI-69-RDD branch from f70258d to 52a4532 Compare August 6, 2020 08:31

garyli1019 added 2 commits August 6, 2020 14:08

[MINOR] fix some tests to pass CI

d8beca9

garyli1019 force-pushed the HUDI-69-RDD branch from 52a4532 to d8beca9 Compare August 6, 2020 21:16

vinothchandar reviewed Aug 7, 2020

View reviewed changes

Clean up test file name, add tests for mixed query type tests

60cec20

- We can now revert the change made in DefaultSource

vinothchandar merged commit 4f74a84 into apache:master Aug 7, 2020

ghost mentioned this pull request Aug 31, 2020

[SUPPORT] AWSDmsAvroPayload not processing Deletes correctly + IOException when reading log file #2057

Closed

yihua mentioned this pull request Jan 23, 2023

[HUDI-5443] Fixing exception trying to read MOR table after NestedSchemaPruning rule has been applied #7528

Merged

4 tasks

hudi-bot mentioned this pull request Nov 30, 2025

NoSuchMethod PartitionedFile on AWS EMR Spark 2.4.5 #14667

Open

[HUDI-69] Support Spark Datasource for MOR table - RDD approach #1848

[HUDI-69] Support Spark Datasource for MOR table - RDD approach #1848

Uh oh!

Conversation

garyli1019 commented Jul 20, 2020

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-commenter commented Jul 20, 2020

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garyli1019 commented Jul 21, 2020

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garyli1019 Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Jul 21, 2020

Uh oh!

garyli1019 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umehrot2 commented Jul 22, 2020

Uh oh!

umehrot2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garyli1019 Jul 21, 2020 •

edited

Loading

umehrot2 Jul 23, 2020 •

edited

Loading