[SPARK-30784] Use ORC nohive #27536

yhuai · 2020-02-11T04:01:03Z

What changes were proposed in this pull request?

This PR sets orc's classifier to nohive, which has shaded hive-storage-api.

Why are the changes needed?

Right now, Hive 2.3 profile pulls in regular orc, which depends on hive-storage-api. However, hive-storage-api and hive-common have the following common class files

org/apache/hadoop/hive/common/ValidReadTxnList.class
org/apache/hadoop/hive/common/ValidTxnList.class
org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class

For example, https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java (pulled in by orc 1.5.8) and https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java (from hive-common 2.3.6) both are in the classpath and they are different. Having both versions in the classpath can cause unexpected behavior due to classloading order. We should still use orc-nohive, which has hive-storage-api shaded.

Does this PR introduce any user-facing change?

How was this patch tested?

This reverts commit 678cf5a.

SparkQA · 2020-02-11T06:30:16Z

Test build #118203 has finished for PR 27536 at commit 35d9a3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T07:13:54Z

Test build #118207 has finished for PR 27536 at commit 26e8bcf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-11T09:05:33Z

cc @wangyum

dongjoon-hyun · 2020-02-11T17:50:30Z

Hi, @yhuai . Thank you for making a PR. Could you fix the UT failures?

yhuai · 2020-02-11T22:57:17Z

oh hive-storage-api still gets pulled in. Let me check.

SparkQA · 2020-02-11T23:30:47Z

Test build #118265 has finished for PR 27536 at commit 9e8791e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2020-02-12T05:48:34Z

hmm. We need to keep hive-storage-api. But I will need to check why we hit the runtime exception. Somehow we used hive-storage-api's VectorizedRowBatch instead of orc's VectorizedRowBatch for orc code path.

This reverts commit 9e8791e.

yhuai · 2020-02-12T05:51:59Z

Also, the error cause was

Caused by: sbt.ForkMain$ForkError: java.lang.NoSuchMethodError: org.apache.orc.TypeDescription.createRowBatch(I)Lorg/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch;
	at org.apache.hadoop.hive.ql.io.orc.WriterImpl.<init>(WriterImpl.java:96)
	at org.apache.hadoop.hive.ql.io.orc.OrcFile.createWriter(OrcFile.java:320)
	at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:103)
	at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:156)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:273)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
	... 9 more

Seems org.apache.hadoop.hive.ql.io.orc.WriterImpl was hive's orc.

yhuai · 2020-02-12T05:58:40Z

taking https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118207/testReport/org.apache.spark.sql.hive/CompressionCodecSuite/both_table_level_and_session_level_compression_are_set/ as an example, I am not getting why the table was turned to a hive orc table.

SparkQA · 2020-02-12T08:05:02Z

Test build #118279 has finished for PR 27536 at commit a5039ab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-02-12T15:11:15Z

retest this please

SparkQA · 2020-02-12T17:32:44Z

Test build #118307 has finished for PR 27536 at commit a5039ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2020-02-14T17:52:38Z

@dongjoon-hyun @wangyum do you happen to know what happened with #27536 (comment)? Seems in hive module, we are sending orc project created VectorizedRowBatch to hive's orc data source instead of the data source file inside orc project.

dongjoon-hyun · 2020-02-14T21:38:37Z

Not yet, @yhuai . Let me check that tonight and during weekend. I didn't dig that deeper until now. I'll ping here if I got something.

yhuai · 2020-02-15T03:34:10Z

thank you @dongjoon-hyun !

wangyum · 2020-02-15T04:00:23Z

Hi @omalley. Is the nohive variant compatible with Hive 2.3? https://issues.apache.org/jira/browse/ORC-174

wangyum · 2020-02-15T04:36:00Z

I personally think it is incompatible, I have tried it many times before.
@yhuai How about replace our hive-thriftserver with another thriftserver that does not depend on Hive. After that, we can easily upgrade the built-in Hive to Hive 3.x to workaround the conflict issue.

dongjoon-hyun · 2020-02-15T05:52:06Z

@wangyum . hive-thriftserver is irrelevant to hive module failures, e.g., org.apache.spark.sql.hive.CompressionCodecSuite, isn't it?

dongjoon-hyun

Hi, @yhuai . I took a look at the failure and the related code in Spark/Orc/Hive. The current sql/hive module failures are mainly due to the following difference.

Hive 1.2 ~ 2.x use the embedded ORC like org.apache.hive.orc.TypeDescription
Hive 2.3 uses Apache ORC like org.apache.orc.TypeDescription. (Note that the package name is different)

$ javap -cp orc-core-1.5.9.jar org.apache.orc.TypeDescription | grep "createRowBatch(int)"
  public org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch createRowBatch(int);

ORC nohive library has the following. (Note that the return type is different)

$ javap -cp orc-core-1.5.9-nohive.jar org.apache.orc.TypeDescription | grep "createRowBatch(int)"
  public org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch createRowBatch(int);

As we know, orc:nohive is for no-hive environment like sql/core. As a result, all ORC tests in sql/core passes always. However, along with that, orc:nohive doesn't aim to support Hive code itself. Hive code needs the original orc library. I believe that this was the original technical difficulty which @wangyum and @gatorsmile tried to fix in #23788 .

yhuai · 2020-02-18T01:34:42Z

@dongjoon-hyun thank you for looking into it. How did sql/hive work with hive 1.2 and orc nohive? Does sql/hive also use hive's orc when hive 1.2 was used?

dongjoon-hyun · 2020-02-18T04:05:16Z

Yes. Right. Hive 1.2 exists before Apache ORC and has all ORC related stuff as a embedded manner like org.apache.hive.orc.TypeDescription.

yhuai · 2020-02-20T18:06:15Z

@dongjoon-hyun So, hive 2.3 depends on apache orc instead of using orc embedded in hive, which means we will need to pull in regular orc instead of orc-nohive. Is my understanding correct?

dongjoon-hyun · 2020-02-21T17:51:19Z

The current failure analysis is due to that. Yes.

For pulling regular ORC here, cc @wangyum and @gatorsmile for the reasoning of the original PR.

omalley · 2020-02-24T15:54:42Z

Sorry for being late to the thread. Yes, now that Spark depends on Hive >= 2.3, we should move away from the nohive variant and share the same ORC release.

yhuai · 2020-02-26T00:58:04Z

thank you @omalley and @dongjoon-hyun!

btw, are we concerned that hive-common shipped with hive 2.3.6 and hive-storage-api 2.6.0 used by orc 1.5.9 share duplicate classes that have different versions? I am worried that we may not consistently pick up the right version due to class loading order, which can cause confusing runtime exception.

yhuai · 2020-03-04T19:25:21Z

Closing it as we need to use regular orc

yhuai added 3 commits February 10, 2020 14:48

[SPARK-30783] Exclude hive-service-rpc

678cf5a

Change maven to use orc-nohive

c39f5db

compile

35d9a3f

yhuai mentioned this pull request Feb 11, 2020

[SPARK-27176][SQL] Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4 #23788

Closed

Revert "[SPARK-30783] Exclude hive-service-rpc"

26e8bcf

This reverts commit 678cf5a.

do not pull in hive-storage-api

9e8791e

Revert "do not pull in hive-storage-api"

a5039ab

This reverts commit 9e8791e.

dongjoon-hyun reviewed Feb 17, 2020

View reviewed changes

dongjoon-hyun added the SQL label Feb 28, 2020

yhuai closed this Mar 4, 2020

[SPARK-30784] Use ORC nohive #27536

[SPARK-30784] Use ORC nohive #27536

Uh oh!

Conversation

yhuai commented Feb 11, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

HyukjinKwon commented Feb 11, 2020

Uh oh!

dongjoon-hyun commented Feb 11, 2020

Uh oh!

yhuai commented Feb 11, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

yhuai commented Feb 12, 2020

Uh oh!

yhuai commented Feb 12, 2020

Uh oh!

yhuai commented Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

xuanyuanking commented Feb 12, 2020

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

yhuai commented Feb 14, 2020

Uh oh!

dongjoon-hyun commented Feb 14, 2020

Uh oh!

yhuai commented Feb 15, 2020

Uh oh!

wangyum commented Feb 15, 2020

Uh oh!

wangyum commented Feb 15, 2020

Uh oh!

dongjoon-hyun commented Feb 15, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

yhuai commented Feb 18, 2020

Uh oh!

dongjoon-hyun commented Feb 18, 2020

Uh oh!

yhuai commented Feb 20, 2020

Uh oh!

dongjoon-hyun commented Feb 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omalley commented Feb 24, 2020

Uh oh!

yhuai commented Feb 26, 2020

Uh oh!

yhuai commented Mar 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dongjoon-hyun commented Feb 21, 2020 •

edited

Loading