Skip to content

Conversation

@garyli1019
Copy link
Member

What is the purpose of the pull request

This PR implements Spark Datasource for MOR table in the RDD approach.

ParquetFileFormat approach PR: #1722

Brief change log

  • Implemented SnapshotRelation
  • Implemented HudiMergeOnReadRDD
  • Implemented separate Iterator to handle merge and unmerge record reader.

Verify this pull request

This change added tests and can be verified as follows:

  • Added TestMORDataSource to verify this feature.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codecov-commenter
Copy link

Codecov Report

Merging #1848 into master will increase coverage by 1.01%.
The diff coverage is 77.77%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1848      +/-   ##
============================================
+ Coverage     62.32%   63.33%   +1.01%     
- Complexity     3508     3522      +14     
============================================
  Files           420      410      -10     
  Lines         17817    17400     -417     
  Branches       1754     1729      -25     
============================================
- Hits          11105    11021      -84     
+ Misses         5959     5630     -329     
+ Partials        753      749       -4     
Flag Coverage Δ Complexity Δ
#hudicli ? ?
#hudiclient 79.25% <ø> (+0.02%) 1257.00 <ø> (-1.00) ⬆️
#hudicommon 54.78% <ø> (ø) 1509.00 <ø> (ø)
#hudihadoopmr 38.72% <0.00%> (?) 163.00 <0.00> (?)
#hudihivesync 72.25% <ø> (ø) 121.00 <ø> (ø)
#hudispark 53.51% <85.36%> (+4.27%) 146.00 <22.00> (+24.00)
#huditimelineservice 63.47% <ø> (ø) 47.00 <ø> (ø)
#hudiutilities 74.64% <ø> (-0.07%) 279.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...hadoop/config/HadoopSerializableConfiguration.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ain/scala/org/apache/hudi/HudiMergeOnReadRDD.scala 77.77% <77.77%> (ø) 11.00 <11.00> (?)
.../main/scala/org/apache/hudi/SnapshotRelation.scala 91.30% <91.30%> (ø) 10.00 <10.00> (?)
...main/scala/org/apache/hudi/DataSourceOptions.scala 93.54% <100.00%> (ø) 0.00 <0.00> (ø)
...src/main/scala/org/apache/hudi/DefaultSource.scala 78.26% <100.00%> (+7.67%) 10.00 <1.00> (+3.00)
...in/scala/org/apache/hudi/IncrementalRelation.scala 72.41% <100.00%> (ø) 13.00 <0.00> (-1.00)
...he/hudi/table/action/commit/UpsertPartitioner.java 95.71% <0.00%> (-1.41%) 33.00% <0.00%> (-1.00%)
...g/apache/hudi/utilities/sources/AvroDFSSource.java 0.00% <0.00%> (ø) 0.00% <0.00%> (ø%)
...che/hudi/utilities/sources/HiveIncrPullSource.java 0.00% <0.00%> (ø) 0.00% <0.00%> (ø%)
... and 71 more

@garyli1019 garyli1019 changed the title WIP[HUDI-69] Support Spark Datasource for MOR table - RDD approach [HUDI-69] Support Spark Datasource for MOR table - RDD approach Jul 21, 2020
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrunedFilteredScan will change the behavior of ParquetRecordReader inside ParquetFileFormat even we are not using the vectorized reader. Still trying to figure out why... I will follow up with PrunedFilteredScan in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we file a JIRA for this. and is it possible to target this for 0.6.0 as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garyli1019
Copy link
Member Author

@vinothchandar @umehrot2 Ready for review. Thanks!

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass. Looks reasonable, but we need some changes before we can land this.

Thanks for the persistent efforts, @garyli1019 !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just make a new datasource level config for this. and translate. mixing raw InputFormat level configs here, feels a bit problematic in terms of long term maitenance>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Maybe something like SNAPSHOT_READ_STRATEGY? So we can control the logic for merge unmerge mergeWithBootstrap e.t.c

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added MERGE_ON_READ_PAYLOAD_KEY and MERGE_ON_READ_ORDERING_KEY. Then we use the payload to do all the merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

payload is actually stored inside hoodie.properties . let me take a stab at how to handle this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to indicate that we will keep scanning the remaining entries in the log and hand them out if dataFileIterator runs out. We need to be careful about how it interplays with split generation. specifically, only works if the each base file has only 1 split..

HudiMergeOnReadFileSplit(partitionedFile, logPaths, latestCommit,
         metaClient.getBasePath, maxCompactionMemoryInBytes, skipMerge)

here partitionedFile has to be a single file and not an input Split. otherwise we will face an issue that the log entry belongs to a different split and the ultimate query will have duplicates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this HudiMergeOnReadFileSplit is based on we pack one baseFile into one partitionedFile during the buildFileIndex. I believe Spark won't split this secretly somewhere...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileFormat could break up a file into splits I think. We can wait for @umehrot2 also to chime in here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One partitioned file will go into one split if I understand this correctly https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L355
That would be difficult to handle if Spark split one parquet into two splits somewhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar @garyli1019 Spark's FileFormat uses can split the parquet files up if there are multiple row groups in the file. In Spark's implementation one PartitionedFile is not a complete file, that is why you can see the start position and length in that.

But in both this implementation and in bootstrap case we are packaging complete file as a split, to avoid the complexity of dealing with partial splits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the remove needed. this map is often spillable.. we should just make sure the remove does not incur additional I/O or soemethng

Copy link
Member Author

@garyli1019 garyli1019 Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky. We need something like skipableIterator.

  • The hudiLogRecords are immutable. So I am making a copy of the keySet() to handle the remove and iterator it through to read back from hudiLogRecords.
  • It's possible that the parquet has duplicate keys and we need to merge with a single logRecord multiple times. This is the behavior of how the compaction handles duplicate keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about double checking if this is actually ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implicitly assumes OvewriteWithLatestPayload ? can we just convert the parquet row as well to Avro and then perform the merge actually calling the right API. HoodieRecordPayload#combineAndGetUpdateValue() ? This is a correctness issue we need to resolve in the PR..
ideally, adding a test case as well to go along with this would be good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@vinothchandar
Copy link
Member

@umehrot2 can you also please make a quick second pass.

@garyli1019 garyli1019 force-pushed the HUDI-69-RDD branch 2 times, most recently from a4c7069 to 41d1d05 Compare July 22, 2020 06:02
Copy link
Member Author

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comments and added support for custom payload. Added tests to demonstrate delete operation with OverwriteWithLatestAvroPayload

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added MERGE_ON_READ_PAYLOAD_KEY and MERGE_ON_READ_ORDERING_KEY. Then we use the payload to do all the merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garyli1019 garyli1019 force-pushed the HUDI-69-RDD branch 2 times, most recently from f38c76c to 5b051e5 Compare July 22, 2020 17:43
@umehrot2
Copy link
Contributor

@umehrot2 can you also please make a quick second pass.

@vinothchandar @garyli1019 Sorry for the late response. Plan to review it later today.

Copy link
Contributor

@umehrot2 umehrot2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look and have some high level comments. Will take another more thorough line by line pass of the code and unit tests once we have addressed the main points.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we call it something like MergeOnReadSnapshotRelation ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this, you can just do:

val tableSchema = schemaUtil.getTableAvroSchema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know. thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using prefix Hoodie for all the newly defined classes, shouldn't we be using Hudi. Isn't that where the community is headed towards ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem is tricky. Both sides make sense to me. It seems like impossible to completely switch from hoodie to hudi everywhere. Should we define a standard for the naming convension? Like class name -> hoodie, package name -> hudi @vinothchandar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly something like HoodieLogFileFormat might make sense to do in future, as it will clearly extract out the Log files reading logic.

Copy link
Contributor

@umehrot2 umehrot2 Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to support filter push-down. Again for the more straight forward scenarios which I mentioned in my other comment, it should be possible to just pass down the user filters in the reader.

However, for Log file merging scenario we may have to take care of the following scenario:

  • Reading base file filtered out say Row X because of filter push-down.
  • Row X had been updated in the log file and has an entry.
  • While merging we need some way to tell that Row X should be filtered out from log file as well, otherwise we may end up still returning it in the result, because based on the merging logic I see that any remaining rows in log file which did not have corresponding key in base file are just appended and returned in the result.

Comment on lines 73 to 75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, what is this use-case where we want to return un-merged rows ? Do we do something similar for MOR queries through input format where we want to return un-merged rows ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we have this option for Hive https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeUnmergedRecordReader.java
The performance will be better without merging. We can avoid the type conversion at least Row -> Avro -> Merge -> Avro -> Row

Comment on lines 60 to 62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need explicit casting ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar This seems like something we can consider using at other places in Hudi code like AvroConversionHelper to convert Avro Records to Rows, instead of maintaining the conversion code in-house.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree here. Our AvroConversionHelper is handling Row, which is not an extension of InternalRow. If we don't need Row specifically, I think we can adapt to the Spark Internal serializer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Instead of looping over baseFileIterator and performing the check whether that key exists in logRecords, will it be more efficient to do it the other way round. Loop over logRecords and perform merge. In the end append all the remaining base file rows.

  • Also is this a good practice to perform the actual fetching in hasNext function, instead of next ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on further thought the first point may not be possible, since if we iterator over log records it will be difficult to find the corresponding base parquet record using record key (for merging).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The log scanner scans all the log files in one batch and handled the merging internally. The output is a hashmap we can use directly. This logRecordsIterator is just looping through the hashmap and doesn't load the row one by one like the daseFileIterator.

  • This is a little bit tricky. If the hasNext return true, but next() doesn't return a value, Spark will throw an exception. In our logic, we don't know hasNext will be true of false until we find the qualified record to read. For example, 100 records in base file and 100 delete records in the log file. We will read 0 row and hasNext should return false in the first call, but we have iterated through the whole base file already. There is a test case for this example.

Copy link
Member Author

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umehrot2 Thanks for reviewing! I clarified some questions in the comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem is tricky. Both sides make sense to me. It seems like impossible to completely switch from hoodie to hudi everywhere. Should we define a standard for the naming convension? Like class name -> hoodie, package name -> hudi @vinothchandar

Comment on lines 60 to 62
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 73 to 75
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we have this option for Hive https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeUnmergedRecordReader.java
The performance will be better without merging. We can avoid the type conversion at least Row -> Avro -> Merge -> Avro -> Row

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree here. Our AvroConversionHelper is handling Row, which is not an extension of InternalRow. If we don't need Row specifically, I think we can adapt to the Spark Internal serializer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The log scanner scans all the log files in one batch and handled the merging internally. The output is a hashmap we can use directly. This logRecordsIterator is just looping through the hashmap and doesn't load the row one by one like the daseFileIterator.

  • This is a little bit tricky. If the hasNext return true, but next() doesn't return a value, Spark will throw an exception. In our logic, we don't know hasNext will be true of false until we find the qualified record to read. For example, 100 records in base file and 100 delete records in the log file. We will read 0 row and hasNext should return false in the first call, but we have iterated through the whole base file already. There is a test case for this example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know. thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I pushdown nothing and pass the full schema as user requested schema, with simply changing from TableScan() to PrunedFilteredScan, the behavior of the parquet reader was changed and not reading the correct schema. I need to dig deeper here.
Let's focus on making the basic functionality work in this PR. I will figure this out with a follow-up PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the filter for log file probably ok because Spark will apply the filter again after the pushdown filter. Description https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L268

@garyli1019
Copy link
Member Author

Added support for PruneFilterScan. Please review this PR again. Thank you!

@garyli1019 garyli1019 force-pushed the HUDI-69-RDD branch 2 times, most recently from d8ef9d3 to 4b05f0f Compare July 27, 2020 20:39
@vinothchandar vinothchandar self-assigned this Jul 29, 2020
@bvaradar
Copy link
Contributor

@garyli1019 : I took this patch and ran it in EMR (Spark-2.4.5-amzn-0). I got the following exceptions when loading S3 dataset.

I am using hudi-spark-bundle in the spark session.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/29 06:57:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-172-31-33-232.us-east-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1595775804042_9837).
Spark session available as 'spark'.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .__/_,// //_\ version 2.4.5-amzn-0
/
/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val dfh = spark.read.format("hudi").load("s3a://hudi.streaming.perf/orders_stream_hudi_mor_4//")
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:144)
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:141)
at org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:75)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided

scala>

Have you seen this issue before ?

@garyli1019
Copy link
Member Author

@bvaradar Thanks for trying this out. java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile looks strange. I will try it out on my production today to see if I can reproduce.

@garyli1019
Copy link
Member Author

@bvaradar I tested on Spark 2.4.0 cdh release with a small dataset, and found a broadcast configuration issue. Pushed a new commit with the fix. Now this work fine on my cluster. I will test a larger dataset tomorrow.
I couldn't reproduce java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile. Is this class somehow missing on aws release? Are you able to import from the spark-shell?

@garyli1019
Copy link
Member Author

Tested on 100GB MOR table. A few partitions have 100% duplicate upsert log file, the other has parquet files only.
For parquet files only partitions, the SNAPSHOT query is as efficient as the READ_OPTIMIZED query. The file split with log files is expensive but is expected.
For one 50MB parquet file, the log file was ~1GB. Each file split has been loaded as one task.
Count performance for 50MB parquet + 1GB log:
merge: 40s
unmerge: 40s
Show performance. Because data source V1 doesn't support limit(), so it will just scan the whole file.
without column pruning: df_mor.show(10) took 40s
with column pruning: df_mor.select("_hoodie_commit_time").show(10) took 27s
@vinothchandar @umehrot2 @bvaradar

@vinothchandar
Copy link
Member

@garyli1019 thanks for testing this! Will review again.

can we resolve the addressed comments in this PR, and also do a small summary on what the follow up work here is?
custom payload support? (sorry if I am re-asking this, just switching back to this again)

@garyli1019
Copy link
Member Author

small summary on what the follow-up work here is?

@vinothchandar For 0.6.0 release, the only one left is incremental pulling. I am currently working on it and will probably get it done by tomorrow. Maybe we can wait until tomorrow to review the whole thing.
The vectorized reader and pruning are supported in the current version.

custom payload support?

Yes, will let the HoodieMergedLogRecordScanner handle the payload loading and only need to specify merge or skipmerge when running the query. Included unit test for delete of OverwriteWithLatestAvroPayload to verify custom payload mechanism.

@vinothchandar
Copy link
Member

@garyli1019 mind rebasing again please? I think you had some conflicts with #1752

@vinothchandar
Copy link
Member

@bvaradar @garyli1019 I laid out the data from the test output here https://gist.github.com/vinothchandar/28cee665c36f806b562efc88821c9061

if you notice, the compaction here succeeded

 LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/20200805162410.commit; isDirectory=false; length=7224; replication=1; blocksize=33554432; modification_time=1596644652000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 

I can see marker files. (but why though, should nt they be deleted if the commit is successful, guess that a question for me) even. but no new base files.

LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F03/4024aaba-eb81-4c3d-8571-993637dbc16d_5-794-5016_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F03/afdb1be6-9dd3-4606-9441-e1e4eb771998_0-794-5011_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F01/01fb19fe-d0cc-43d1-806d-b5e48bb806b3_3-794-5014_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F01/28ce1ffe-d00a-45fc-88f1-ec6705544c08_1-794-5012_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F02/765cbc30-12b7-4e81-afd3-c0a0f8d7a41d_2-794-5013_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 
   LocatedFileStatus{path=file:/tmp/junit4111271187227693299/dataset/.hoodie/.temp/20200805162410/datestr=2020%252F04%252F02/b93eb05a-36af-40a5-86f4-d7c3159991f4_4-794-5015_20200805162410.parquet.marker.MERGE; isDirectory=false; length=0; replication=1; blocksize=33554432; modification_time=1596644651000; access_time=0; owner=travis; group=travis; permission=rw-r--r--; isSymlink=false}, 

seems like a legit issue to me. cannot imagine how it can happen though

@vinothchandar
Copy link
Member

rerun tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar Looks like some unrelated change was added during the rebase. Maybe this is related to the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, this was intended based on previous comments to keep the consistency with others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was something that i should have caught in the earlier pr . I renamed it here, thinking we will be landing this quickly. oh well. lets just get the test to pass

@vinothchandar
Copy link
Member

@bvaradar locally, I can see all the 12 files generated as expected. I cannot imagine, why this can happen in ci alone.

@vinothchandar vinothchandar force-pushed the HUDI-69-RDD branch 3 times, most recently from a9533a0 to f70258d Compare August 6, 2020 07:48
@vinothchandar
Copy link
Member

@garyli1019 I am afraid this has something to do with the changes we for InMemoryFileIndex or sth made in the pr .

>>>> TestBootstrap : 
	 files:[file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet, file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet]
	 numVersions:2
	 numFiles:6
	 bootstrapBasePath:file:/tmp/junit3878890598586882351/data/_SUCCESS
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F03/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F01/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00000-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
		file:/tmp/junit3878890598586882351/data/datestr=2020%252F04%252F02/part-00001-5b566025-a845-494b-830e-e203c2ab142f.c000.snappy.parquet
	 basePath:file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/.44f3a4f0-ec41-4a43-889e-b133bccaaf40_00000000000001.log.1_0-937-6957
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_0-895-6505_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/.hoodie_partition_metadata
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-943-6973_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-895-6505_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/.830616aa-e406-4155-aa63-c9c80d15212d_00000000000001.log.1_1-937-6958
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_5-943-6978_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/.hoodie/00000000000001.deltacommit.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081528.compaction.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081528.compaction.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081524.deltacommit.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/00000000000001.deltacommit.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081524.deltacommit
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.aux/20200806081528.compaction.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.aux/.bootstrap/.fileids/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081517.restore.inflight
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081524.deltacommit.requested
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_5-943-6978_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-943-6973_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_3-943-6976_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-943-6974_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_4-943-6977_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/20200806081528/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-943-6975_20200806081528.parquet.marker.MERGE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F03/830616aa-e406-4155-aa63-c9c80d15212d_0-895-6505_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F03/44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-895-6505_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_1-895-6506_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-895-6506_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-895-6507_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/.temp/00000000000001/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_2-895-6507_00000000000001.parquet.marker.CREATE
		file:/tmp/junit3878890598586882351/dataset/.hoodie/hoodie.properties
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081528.commit
		file:/tmp/junit3878890598586882351/dataset/.hoodie/00000000000001.deltacommit
		file:/tmp/junit3878890598586882351/dataset/.hoodie/20200806081517.restore
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_1-895-6506_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/.562f9e3c-29a4-4f3b-9405-0796eaea0717_00000000000001.log.1_4-937-6961
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-895-6506_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/562f9e3c-29a4-4f3b-9405-0796eaea0717_1-943-6974_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/.hoodie_partition_metadata
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/.cc465d37-07a6-419b-86a0-7756d1dd8ca4_00000000000001.log.1_5-937-6962
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F01/cc465d37-07a6-419b-86a0-7756d1dd8ca4_3-943-6976_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-895-6507_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/.6b045efb-f008-44e9-9e44-25f7ed7c7e36_00000000000001.log.1_3-937-6960
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/.ae7740f7-b986-48a5-9d3c-2a784f0ee262_00000000000001.log.1_2-937-6959
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/.hoodie_partition_metadata
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_2-895-6507_00000000000001.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/6b045efb-f008-44e9-9e44-25f7ed7c7e36_4-943-6977_20200806081528.parquet
		file:/tmp/junit3878890598586882351/dataset/datestr=2020%252F04%252F02/ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-943-6975_20200806081528.parquet

May be the path filter getting set, but not being removed or something? I can now see all 12 files present, as expected. but spark.read.parquet() picks up just 6, the latests parquet files

+----------------------------------------------------------------------+
|_hoodie_file_name                                                     |
+----------------------------------------------------------------------+
|44f3a4f0-ec41-4a43-889e-b133bccaaf40_0-943-6973_20200806081528.parquet|
|ae7740f7-b986-48a5-9d3c-2a784f0ee262_2-943-6975_20200806081528.parquet|
|6b045efb-f008-44e9-9e44-25f7ed7c7e36_4-943-6977_20200806081528.parquet|
|cc465d37-07a6-419b-86a0-7756d1dd8ca4_3-943-6976_20200806081528.parquet|
|830616aa-e406-4155-aa63-c9c80d15212d_5-943-6978_20200806081528.parquet|
|562f9e3c-29a4-4f3b-9405-0796eaea0717_1-943-6974_20200806081528.parquet|
+----------------------------------------------------------------------+

@vinothchandar
Copy link
Member

[ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 66.614 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
[ERROR] org.apache.hudi.functional.TestCOWDataSource.testStructuredStreaming  Time elapsed: 25.766 s  <<< ERROR!
java.util.concurrent.ExecutionException: Boxed Error
Caused by: org.opentest4j.AssertionFailedError: expected: <2> but was: <3>

@garyli1019 this is failing now. See my last fix. just unsetting the path filter after resolving the relation seems to help get over the issue. So the root issue there was the path filter still kicking in for spark.read.format('parquet') (which is kind of expected even)

@garyli1019
Copy link
Member Author

[ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 66.614 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
[ERROR] org.apache.hudi.functional.TestCOWDataSource.testStructuredStreaming  Time elapsed: 25.766 s  <<< ERROR!
java.util.concurrent.ExecutionException: Boxed Error
Caused by: org.opentest4j.AssertionFailedError: expected: <2> but was: <3>

@garyli1019 this is failing now. See my last fix. just unsetting the path filter after resolving the relation seems to help get over the issue. So the root issue there was the path filter still kicking in for spark.read.format('parquet') (which is kind of expected even)

@vinothchandar agree. I can see the InMemoryFileIndex warning on every test following the SparkStreaming test, but don't see it if I run some COW test alone. I will try to change the test set up to clean the spark context properly.

- This PR implements Spark Datasource for MOR table in the RDD approach.
- Implemented SnapshotRelation
- Implemented HudiMergeOnReadRDD
- Implemented separate Iterator to handle merge and unmerge record reader.
- Added TestMORDataSource to verify this feature.
options = optParams)
.resolveRelation()

sqlContext.sparkContext.hadoopConfiguration.unset("mapreduce.input.pathFilter.class")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bvaradar keeping this code actually. given now, we are adding support for MOR snapshot query, if we don't unset this, then if you do a RO query and then a snapshot query, then this will filter out all except latest base files. which will be problematic.
cc @garyli1019 does this make sense? Let me see if I can add a test case for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes make sense. We unset this before the incremental query. When we only have two data source query type, it's fine to unset before running incremental query, but right now we have to unset this here. Still wondering why the local build is able to pass though...

 - We can now revert the change made in DefaultSource
@vinothchandar
Copy link
Member

@garyli1019 I wrote a test for this. Seems like this is actually not a problem. So reverted the unset for now. Please check my last commit.

now, if CI passes this timee we can land (fingers crossed)

@vinothchandar vinothchandar merged commit 4f74a84 into apache:master Aug 7, 2020
@vinothchandar
Copy link
Member

@garyli1019 Thanks for being so patient, and amazing and persevering ! this is now landed.! congrats!

@luffyd
Copy link

luffyd commented Sep 4, 2020

@garyli1019 : I took this patch and ran it in EMR (Spark-2.4.5-amzn-0). I got the following exceptions when loading S3 dataset.

I am using hudi-spark-bundle in the spark session.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/29 06:57:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-172-31-33-232.us-east-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1595775804042_9837).
Spark session available as 'spark'.
Welcome to


/ / ___ ****/ /** \ / _ / _ `/ __/ '/ /**/ .__/,// //\ version 2.4.5-amzn-0 //

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val dfh = spark.read.format("hudi").load("s3a://hudi.streaming.perf/orders_stream_hudi_mor_4//")
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:144)
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:141)
at org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:75)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided

scala>

Have you seen this issue before ?

I am seeing this exception with latest Hudi, how do we get it resolved?

@luffyd
Copy link

luffyd commented Sep 4, 2020

@garyli1019 : I took this patch and ran it in EMR (Spark-2.4.5-amzn-0). I got the following exceptions when loading S3 dataset.
I am using hudi-spark-bundle in the spark session.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/07/29 06:57:50 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://ip-172-31-33-232.us-east-2.compute.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1595775804042_9837).
Spark session available as 'spark'.
Welcome to

/ / ___ ****/ /** \ / _ / _ `/ __/ '/ /**/ .__/,// //\ version 2.4.5-amzn-0 //
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val dfh = spark.read.format("hudi").load("s3a://hudi.streaming.perf/orders_stream_hudi_mor_4//")
java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:144)
at org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:141)
at org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:75)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:70)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided
scala>
Have you seen this issue before ?

I am seeing this exception with latest Hudi, how do we get it resolved?

I see Hudi uses spark-sql_2.11-2.4.4.jar and EMR uses /usr/lib/spark/jars/spark-sql_2.11-2.4.5-amzn-0.jar
Not sure If it makes a difference for that PartitionedFile class

@garyli1019
Copy link
Member Author

@luffyd Thanks for reporting. I created a ticket to track this: https://issues.apache.org/jira/browse/HUDI-1270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants