Skip to content

Conversation

@umehrot2
Copy link

What changes were proposed in this pull request?

Reading from a Hive ORC table containing char/varchar columns fails in Spark SQL. This is caused by the fact that Spark SQL internally replaces the char/varchar columns with String data type. So, while reading from the table created in Hive which has varchar/char columns, it ends up using the wrong reader and causes a ClassCastException.

This patch allows Spark SQL to interpret varchar/char columns correctly, and store them as varchar/char type instead of internally converting to string columns.

How was this patch tested?

-> Added Unit tests
-> Manually tested on AWS EMR cluster

Step 1:
Created a table using hive (having varchar/char columns), and inserted some data:

CREATE EXTERNAL TABLE IF NOT EXISTS hive_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';

INSERT INTO TABLE hive_orc_test VALUES ('abc', 'A', 101), ('abc1', 'B', 102), ('abc3', 'C', 103);

Step 2:
Created an external table in Spark SQL using the same source location, and run a select query on that.

CREATE EXTERNAL TABLE IF NOT EXISTS spark_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';

SELECT * form spark_orc_test;

Result:
17/02/24 23:22:57 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 2.673360 s
abc A 101
abc1 B 102
abc3 C 103
Time taken: 4.327 seconds, Fetched 3 row(s)

@umehrot2 umehrot2 changed the title Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail [SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail Apr 27, 2017
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mridulm
Copy link
Contributor

mridulm commented Apr 27, 2017

+CC @dongjoon-hyun - since you were looking at ORC.

@hvanhovell
Copy link
Contributor

hvanhovell commented Apr 27, 2017

This is very similar to #16804 however that approach is like this one is slightly broken (because it does not support nested char/varchar columns), can you just backport #17030 which is an improved version.

@dongjoon-hyun
Copy link
Member

Thank you for pining me, @mridulm . :)

@gatorsmile
Copy link
Member

BTW, please add [BACKPORT-2.0] in your PR title.

@HyukjinKwon
Copy link
Member

ping @umehrot2

@asfgit asfgit closed this in b771fed Jun 8, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
# What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

Closes apache#11459
Closes apache#13833
Closes apache#13720
Closes apache#12506
Closes apache#12456
Closes apache#12252
Closes apache#17689
Closes apache#17791
Closes apache#18163
Closes apache#17640
Closes apache#17926
Closes apache#18163
Closes apache#12506
Closes apache#18044
Closes apache#14036
Closes apache#15831
Closes apache#14461
Closes apache#17638
Closes apache#18222

Added:
Closes apache#18045
Closes apache#18061
Closes apache#18010
Closes apache#18041
Closes apache#18124
Closes apache#18130
Closes apache#12217

Added:
Closes apache#16291
Closes apache#17480
Closes apache#14995

Added:
Closes apache#12835
Closes apache#17141

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#18223 from HyukjinKwon/close-stale-prs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants