Skip to content

Commit e3d4349

Browse files
dongjoon-hyuncloud-fan
authored andcommitted
[SPARK-22279][SQL] Enable convertMetastoreOrc by default
## What changes were proposed in this pull request? We reverted `spark.sql.hive.convertMetastoreOrc` at #20536 because we should not ignore the table-specific compression conf. Now, it's resolved via [SPARK-23355](8aa1d7b). ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <[email protected]> Closes #21186 from dongjoon-hyun/SPARK-24112.
1 parent 62d0139 commit e3d4349

File tree

2 files changed

+3
-3
lines changed

2 files changed

+3
-3
lines changed

docs/sql-programming-guide.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1017,7 +1017,7 @@ the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also
10171017
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
10181018
<tr>
10191019
<td><code>spark.sql.orc.impl</code></td>
1020-
<td><code>hive</code></td>
1020+
<td><code>native</code></td>
10211021
<td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
10221022
</tr>
10231023
<tr>
@@ -1813,6 +1813,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
18131813
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
18141814
- In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround.
18151815
- In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files.
1816+
- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
18161817

18171818
## Upgrading From Spark SQL 2.2 to 2.3
18181819

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,11 +105,10 @@ private[spark] object HiveUtils extends Logging {
105105
.createWithDefault(false)
106106

107107
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
108-
.internal()
109108
.doc("When set to true, the built-in ORC reader and writer are used to process " +
110109
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
111110
.booleanConf
112-
.createWithDefault(false)
111+
.createWithDefault(true)
113112

114113
val HIVE_METASTORE_SHARED_PREFIXES = buildConf("spark.sql.hive.metastore.sharedPrefixes")
115114
.doc("A comma separated list of class prefixes that should be loaded using the classloader " +

0 commit comments

Comments
 (0)