-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1040] Update apis for spark3 compatibility #1760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how expensive would this be in practice? any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we can use SparkContext.version or something to figure it out upfront..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking spark version would be a good idea to prevent the failed call everytime, let me look into it. I don't believe the reflection performance is significant as a whole, especially if we can figure out the spark version upfront. We do something similar here:
9f12843
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was concerned about the exception thrown each time.. in the case you mentioned, its obtaining the description once.. this would be in the fast path.. Anyways.. lets hope the version thing is feasible..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think org.apache.spark.SPARK_VERSION could help us out here. I've implemented it in this commit: sbernauer@a4f1866. But I noticed that it had a huge performance impact (I assume due to the reflection)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to change the method to something like this
import org.apache.spark.SPARK_VERSION
private def deserializeRow(encoder: ExpressionEncoder[Row], internalRow: InternalRow): Row = {
// TODO remove reflection if Spark 2.x support is dropped
if (SPARK_VERSION.startsWith("2.")) {
val spark2method = encoder.getClass.getMethod("fromRow", classOf[InternalRow])
spark2method.invoke(encoder, internalRow).asInstanceOf[Row]
} else {
val deserializer = encoder.getClass.getMethod("createDeserializer").invoke(encoder)
val aboveSpark2method = deserializer.getClass.getMethod("apply", classOf[InternalRow])
aboveSpark2method.invoke(deserializer, internalRow).asInstanceOf[Row]
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I will try making the same change as that commit you linked but I am going to see if I can test the performance impact first.
If you have any more details on how it affected performance please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit with all the patches (including better reflection) is this here: sbernauer@0dd7172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan are you able to take a swing at this for 0.6.0? this would be good to have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbernauer 's check is better I think.
On the reflection itself, @bschell would it help if we cached the method objects, instead of recreating everytime. have you attempted anything like that?
|
@bschell is this tested and ready to go? would like to get it into the RC if possible |
|
@vinothchandar While this works, the reflection does hurt performance as this is a frequently used path. I was looking into any better options to workaround the performance hit. |
|
@bschell @vinothchandar : I gave it a shot on this. I don't have permission to push to your branch to update this PR. |
|
@nsivabalan Looks like a recent commit made this Class private: apache/spark@ce2cdc3 However I think we can workaround this by porting the code for those methods over to Hudi (it's not too long). |
|
@bschell : cool. I will let you drive the patch then. I gave it a shot thinking if we can get it to 0.6.0. But we couldn't make it. So, will let you drive this. |
|
@bschell this is a heavily requested feature. are you still working on this? :) |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bschell would something like below (I think we need to lazily initialize the methods instead of serializing from the driver, for this to actually work), where we just fetch the methods upfront help.
def createRdd(df: DataFrame, avroSchema: Schema, structName: String, recordNamespace: String)
: RDD[GenericRecord] = {
// Use the Avro schema to derive the StructType which has the correct nullability information
val dataType = SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
val encoder = RowEncoder.apply(dataType).resolveAndBind()
val spark2method = if (SPARK_VERSION.startsWith("2.")) {
encoder.getClass.getMethods.filter(method => method.getName.equals("fromRow")).last
} else {
null
}
val spark3method = encoder.getClass.getMethods.filter(method => method.getName.equals("createDeserializer")).last
val deserializer = spark3method.invoke(encoder)
val applyMethod = deserializer.getClass.getMethods.filter(method => method.getName.equals("apply")).last
df.queryExecution
.toRdd.map[Row](internalRow => {
if (spark2method != null) {
spark2method.invoke(encoder, internalRow).asInstanceOf[Row]
} else {
applyMethod.invoke(deserializer, internalRow).asInstanceOf[Row]
}
}).mapPartitions { records =>
if (records.isEmpty) Iterator.empty
else {
val convertor = AvroConversionHelper.createConverterToAvro(dataType, structName, recordNamespace)
records.map { x => convertor(x).asInstanceOf[GenericRecord] }
}
}
}
|
May not work actually gives the following results let's abandon the reflection approach altogether. Either we munge the APIs somehow or just make a separate module for spark3 . e.g |
|
@bschell here's a path forward.
|
|
@vinothchandar Thank you for the suggestions! I agree with this approach and will try to update this PR to match. |
Add the new separate spark3 module to handle the reflection required to support both spark2 & spark3 together.
|
Seems like travis doesn't have the hudi_spark_2.12 bundle available, only the 2.11. Any suggestions? |
|
@bschell : For running integration tests with hudi packages built with scala 2.12, we just need to change scripts/run_travis_tests.sh. The docker container should automatically load those jars for running integration tests. index 63fb959c..b77b4f64 100755
--- a/scripts/run_travis_tests.sh
+++ b/scripts/run_travis_tests.sh
@@ -35,7 +35,7 @@ elif [ "$mode" = "integration" ]; then
export SPARK_HOME=$PWD/spark-${sparkVersion}-bin-hadoop${hadoopVersion}
mkdir /tmp/spark-events/
echo "Running Integration Tests"
- mvn verify -Pintegration-tests -B
+ mvn verify -Pintegration-tests -Dscala-2.12 -B
else
echo "Unknown mode $mode"
exit 1 |
|
@bschell : Did you try the above change I mentioned ? Let me know |
|
Looks like only integ tests are failing, but I'm not sure of the error. Will take a closer look. |
|
In our testing we've found some larger scale issues with this approach and conflicts with hudi-0.6.0 refactors. Mainly with spark datasource api interface changes, need to reevaluate this. |
|
@bschell if you could expand on them, we can hash out a solution. |
|
@vinothchandar We are compiling the issues right now. I will update here once we are finished later today/tomorrow |
|
@vinothchandar Has Hudi been tested for compatibility with spark v3.0.0 ? did a mvn build but pyspark throws a dependency error - org.apache.hudi#hudi-spark-bundle_2.12;0.6.1: not found |
|
@vinothchandar @bschell appreciate any guidance regarding this error, I am using COW hudi options on Spark 3.0 df.write.format("hudi"). On writing df as hudi parquet , it throws the below error : to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, 6049cd42243a, executor driver): java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps; Driver stacktrace: |
|
Got hive class error |
|
@bschell @vinothchandar to make clear, just wondering what is the exact goal for this pr? Do we want to make Hudi support both compile & run with spark 3 or we want to make Hudi compile with spark 2 and then run with spark3? Ideally we should make Hudi both compile and run with Spark3. But current code change cannot compile with spark 3. Run returns |
this was the intention. but as @bschell pointed out some classes have changed and we need to make parts of |
I suspect that this is because Spark 3.0.0 uses Hive 2.3.7 but Spark 2.x uses Hive 1.2.1.spark2 and this causes some API conflict. Looks like this signature: If we compile with Spark 2 and then run with Spark 3, we might run into this kind of issue. But I didn't see any Hudi class in the error stack, can you provide more information about this error? |
Is this a permanent change or we just try to run test here? @bschell @vinothchandar @bvaradar I ran into a scala class not found error when running docker integ testing for Hudi: I suspect that although Hudi is able to pick the |
|
Found another runtime error when updating MOR table: This is because the signature of |
|
@bschell or others: I was just trying this patch out locally. My local compilation is failing w/ unapproved license. Do you also hit the same or wondering what am I missing. Thought the patch in itself should successfully compile. And its strange that I get unapproved licenses for iml files which I usually don't in my repo branches. |
|
my bad didn't notice the other patch. I will check out the other one. |
|
The work for spark 3 has been moved over to #2208 . Hence closing this. |
Modifies use of spark apis for compatibility with both spark2 and spark3
What is the purpose of the pull request
Updates spark apis and allows compatibility with both spark3/spark2
Verify this pull request
Existing tests verify the functionality for spark2.
Manual tests for spark3 have been performed on EMR clusters.
You can verify by building Hudi successfully with spark v3.0.0
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.