-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30784] Use ORC nohive #27536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30784] Use ORC nohive #27536
Conversation
This reverts commit 678cf5a.
|
Test build #118203 has finished for PR 27536 at commit
|
|
Test build #118207 has finished for PR 27536 at commit
|
|
cc @wangyum |
|
Hi, @yhuai . Thank you for making a PR. Could you fix the UT failures? |
|
oh hive-storage-api still gets pulled in. Let me check. |
|
Test build #118265 has finished for PR 27536 at commit
|
|
hmm. We need to keep hive-storage-api. But I will need to check why we hit the runtime exception. Somehow we used hive-storage-api's VectorizedRowBatch instead of orc's VectorizedRowBatch for orc code path. |
This reverts commit 9e8791e.
|
Also, the error cause was Seems org.apache.hadoop.hive.ql.io.orc.WriterImpl was hive's orc. |
|
taking https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118207/testReport/org.apache.spark.sql.hive/CompressionCodecSuite/both_table_level_and_session_level_compression_are_set/ as an example, I am not getting why the table was turned to a hive orc table. |
|
Test build #118279 has finished for PR 27536 at commit
|
|
retest this please |
|
Test build #118307 has finished for PR 27536 at commit
|
|
@dongjoon-hyun @wangyum do you happen to know what happened with #27536 (comment)? Seems in hive module, we are sending orc project created VectorizedRowBatch to hive's orc data source instead of the data source file inside orc project. |
|
Not yet, @yhuai . Let me check that tonight and during weekend. I didn't dig that deeper until now. I'll ping here if I got something. |
|
thank you @dongjoon-hyun ! |
|
Hi @omalley. Is the nohive variant compatible with Hive 2.3? https://issues.apache.org/jira/browse/ORC-174 |
|
I personally think it is incompatible, I have tried it many times before. |
|
@wangyum . |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @yhuai . I took a look at the failure and the related code in Spark/Orc/Hive. The current sql/hive module failures are mainly due to the following difference.
- Hive 1.2 ~ 2.x use the embedded ORC like
org.apache.hive.orc.TypeDescription - Hive 2.3 uses Apache ORC like
org.apache.orc.TypeDescription. (Note that the package name is different)
$ javap -cp orc-core-1.5.9.jar org.apache.orc.TypeDescription | grep "createRowBatch(int)"
public org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch createRowBatch(int);
- ORC nohive library has the following. (Note that the return type is different)
$ javap -cp orc-core-1.5.9-nohive.jar org.apache.orc.TypeDescription | grep "createRowBatch(int)"
public org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch createRowBatch(int);
As we know, orc:nohive is for no-hive environment like sql/core. As a result, all ORC tests in sql/core passes always. However, along with that, orc:nohive doesn't aim to support Hive code itself. Hive code needs the original orc library. I believe that this was the original technical difficulty which @wangyum and @gatorsmile tried to fix in #23788 .
|
@dongjoon-hyun thank you for looking into it. How did sql/hive work with hive 1.2 and orc nohive? Does sql/hive also use hive's orc when hive 1.2 was used? |
|
Yes. Right. |
|
@dongjoon-hyun So, hive 2.3 depends on apache orc instead of using orc embedded in hive, which means we will need to pull in regular orc instead of orc-nohive. Is my understanding correct? |
|
The current failure analysis is due to that. Yes. For pulling regular ORC here, cc @wangyum and @gatorsmile for the reasoning of the original PR. |
|
Sorry for being late to the thread. Yes, now that Spark depends on Hive >= 2.3, we should move away from the nohive variant and share the same ORC release. |
|
thank you @omalley and @dongjoon-hyun! btw, are we concerned that hive-common shipped with hive 2.3.6 and hive-storage-api 2.6.0 used by orc 1.5.9 share duplicate classes that have different versions? I am worried that we may not consistently pick up the right version due to class loading order, which can cause confusing runtime exception. |
|
Closing it as we need to use regular orc |
What changes were proposed in this pull request?
This PR sets orc's classifier to nohive, which has shaded hive-storage-api.
Why are the changes needed?
Right now, Hive 2.3 profile pulls in regular orc, which depends on hive-storage-api. However, hive-storage-api and hive-common have the following common class files
For example, https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java (pulled in by orc 1.5.8) and https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java (from hive-common 2.3.6) both are in the classpath and they are different. Having both versions in the classpath can cause unexpected behavior due to classloading order. We should still use orc-nohive, which has hive-storage-api shaded.
Does this PR introduce any user-facing change?
How was this patch tested?