Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

In a HiveSessionState, which is a given SparkSession backed by Hive, the analysis should not be case sensitive because the underlying Hive Metastore is case insensitive.

For example,

CREATE TABLE tab1 (C1 int);
SELECT C1 FROM tab1

In the current implementation, we will get the following error because the column name is always stored in lower case.

cannot resolve '`C1`' given input columns: [c1]; line 1 pos 7
org.apache.spark.sql.AnalysisException: cannot resolve '`C1`' given input columns: [c1]; line 1 pos 7

This PR is to always use case insensitive analysis in HiveSessionState, no matter whether users set spark.sql.caseSensitive to true or false.

How was this patch tested?

Added the related test cases.

@SparkQA
Copy link

SparkQA commented May 9, 2016

Test build #58114 has finished for PR 12993 at commit d7d96c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @cloud-fan @rxin @yhuai @andrewor14

@cloud-fan
Copy link
Contributor

I think we need to discuss it more:

  1. should we allow the case sensitivity to be configurable? It's sometimes out of our control like hive catalog, which is always case insensitive
  2. except case sensitivity, should we also include the concept of case-preserving for external catalog?

@gatorsmile
Copy link
Member Author

gatorsmile commented May 9, 2016

Agree. We need to be careful for deciding the design. This PR is just to recover our previous behavior in HiveContext.

Regarding case sensitivity, it is complicated and platform/vender-specific. Below is based on my search. It might not be 100% correct.

  • For the un-quoted identifiers, the SQL2003 compliance and DB2 is No. Oracle and SQL Server are configurable, but the default is No.
  • For the quoted/delimited identifiers, most traditional RDBMS are case sensitive. Hive is special. Starting from Hive 1.3, Hive supports quoted identifiers in Column names. https://issues.apache.org/jira/browse/HIVE-6013 However, this is not applicable to the Table/Database/Function names in Hive.

@rxin
Copy link
Contributor

rxin commented May 9, 2016

We want to eliminate HiveSessionState, so this is going a step back, and this is taking another step back in diverging the behavior of the Hive one and non-Hive one.

I don't think we should support this, and for now just make case sensitivity an internal config and not exposed to user. Our case sensitivity support is somewhat broken and does not follow sql standard (e.g. in postgres quoting something makes them case sensitive), so the simplest solution is to not support it for now and

See https://issues.apache.org/jira/browse/SPARK-15229

@gatorsmile
Copy link
Member Author

Agree. Let me close this now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants