-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14387][SQL] Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc #14471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #63147 has finished for PR 14471 at commit
|
|
Fixed scalastyle issues |
|
Test build #63150 has finished for PR 14471 at commit
|
|
Can you add a test case? |
|
also can you update the title? The current title is very generic. This ticket seems to be solving a specific problem. |
…ive.convertMetastoreOrc
|
Thanks @rxin. Changes:
|
|
Test build #63169 has finished for PR 14471 at commit
|
|
Jenkins, retest this please. |
|
Test build #65687 has finished for PR 14471 at commit
|
|
Hi, @rajeshbalamohan . I'll refer your commit for SPARK-19459 . You'll be the main author in case of merge. |
|
Hi, @rajeshbalamohan . |
Closes apache#11494 Closes apache#14158 Closes apache#16803 Closes apache#16864 Closes apache#17455 Closes apache#17936 Closes apache#19377 Added: Closes apache#19380 Closes apache#18642 Closes apache#18377 Closes apache#19632 Added: Closes apache#14471 Closes apache#17402 Closes apache#17953 Closes apache#18607 Also cc srowen vanzin HyukjinKwon gatorsmile cloud-fan to see if you have other PRs to close. Author: Xingbo Jiang <[email protected]> Closes apache#19669 from jiangxb1987/stale-prs.
What changes were proposed in this pull request?
This PR improves ORCFileFormat to handle cases when schema stored in the ORC file does not match the schema stored in metastore.
ORC Data written by Hive-1.x had virtual column names (HIVE-4243). This is fixed in Hive-2.x, but for data stored using Hive-1.x spark would throw exceptions. To mitigate this, "spark.sql.hve.convertMetastoreOrc" was disabled via SPARK-15705. However, that would incur
performance penalties as it would go via HiveTableScan and HadoopRDD. This PR fixes this issue.
Related tickets:
SPARK-15705 : Change the default value of spark.sql.hive.convertMetastoreOrc to false.
SPARK-15705 : Spark won't read ORC schema from metastore for partitioned tables
SPARK-16628 : OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files
How was this patch tested?
Manual testing by setting "spark.sql.hve.convertMetastoreOrc=true" and querying data stored via Hive-1.x in ORC format. Also ran unit-tests related to sql.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)