[SPARK-11412][SQL] Support merge schema for ORC #24043

WangGuangxin · 2019-03-10T08:20:17Z

What changes were proposed in this pull request?

Currently, ORC's inferSchema is implemented as randomly choosing one ORC file and reading its schema.

This PR follows the behavior of Parquet, it implements merge schemas logic by reading all ORC files in parallel through a spark job.

Users can enable merge schema by spark.read.orc("xxx").option("mergeSchema", "true") or by setting spark.sql.orc.mergeSchema to true, the prior one has higher priority.

How was this patch tested?

tested by UT OrcUtilsSuite.scala

WangGuangxin · 2019-03-12T11:49:04Z

@dongjoon-hyun Could you please review this PR?

dongjoon-hyun · 2019-05-07T21:46:30Z

Retest this please.

dongjoon-hyun · 2019-05-07T21:53:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

What happens for OrcDataSourceV2? It's used, isn't it?

SparkQA · 2019-05-07T21:53:46Z

Test build #105233 has finished for PR 24043 at commit f7532a4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-05-07T21:58:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtilsSuite.scala

Please move this test case to OrcSuite. Then, both OrcSourceSuite and HiveOrcSourceSuite will test native and hive ORC implementations respectively.

And, remove this OrcUtilsSuite test suite.

dongjoon-hyun · 2019-05-07T22:00:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcFileFormatSuite.scala

Remove this test suite, please.

dongjoon-hyun

Hi, @WangGuangxin . Sorry for late response. I left a few comments.

WangGuangxin · 2019-05-22T02:52:13Z

@dongjoon-hyun Thanks for your review. I've updated it according to your comments.

dongjoon-hyun · 2019-05-22T05:02:40Z

Retest this please.

SparkQA · 2019-05-22T05:18:59Z

Test build #105665 has finished for PR 24043 at commit e84cdba.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-05-22T05:53:11Z

The build failure was because the type of option has been changed in lastest master. Fixed by rebase with master.

Retest this please.

dongjoon-hyun · 2019-05-22T10:18:53Z

Retest this please.

SparkQA · 2019-05-22T14:12:15Z

Test build #105689 has finished for PR 24043 at commit 7d833b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-05-28T02:05:13Z

kindly ping @dongjoon-hyun

dongjoon-hyun · 2019-05-28T02:16:24Z

Sorry for the delay, @WangGuangxin .

dongjoon-hyun · 2019-05-28T02:16:31Z

Retest this please.

dongjoon-hyun · 2019-05-28T02:25:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

  }

+  /**
+   * Read single ORC file schema using native version of ORC


Please add the following line because this is only used in this file and OrcSourceSuite.

* This is visible for testing.

dongjoon-hyun · 2019-05-28T02:31:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+  }
+
+  def inferSchema(sparkSession: SparkSession, files: Seq[FileStatus], options: Map[String, String])
+      : Option[StructType] = {


So, this function is used in native ORC readers (OrcFileFormat/OrcTable), and hive OrcFileFormat has its own implementation of inferSchema, right?

So, this function is used in native ORC readers (OrcFileFormat/OrcTable), and hive OrcFileFormat has its own implementation of inferSchema, right?

Yes. Do you think it's necessary to refactor this function?

dongjoon-hyun · 2019-05-28T02:41:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+   * Read single ORC file schema using native version of ORC
+   */
+  def singleFileSchemaReader(file: String, conf: Configuration, ignoreCorruptFiles: Boolean)
+      : Option[StructType] = {


Unfortunately, the existing code around here follows a wrong indentation rule. Let's use correct indentation at least at new code. : Option[StructType] should have 2-space indentation instead of 4-space.

def singleFileSchemaReader(file: String, conf: Configuration, ignoreCorruptFiles: Boolean) - : Option[StructType] = { + : Option[StructType] = {

dongjoon-hyun · 2019-05-28T02:41:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+      sparkSession: SparkSession,
+      files: Seq[FileStatus],
+      singleFileSchemaReader: (String, Configuration, Boolean) => Option[StructType])
+      : Option[StructType] = {


ditto. 2-space.

dongjoon-hyun · 2019-05-28T02:46:33Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala

+   * Read single ORC file schema using Hive ORC library
+   */
+  def singleFileSchemaReader(file: String, conf: Configuration, ignoreCorruptFiles: Boolean)
+      : Option[StructType] = {


ditto. 2-space.

dongjoon-hyun

Also, please add the followings at ReadSchemaSuite.

@@ -32,6 +32,7 @@ import org.apache.spark.sql.internal.SQLConf
  *
  *     -> OrcReadSchemaSuite
  *     -> VectorizedOrcReadSchemaSuite
+ *     -> MergedORCReadSchemaSuite

@@ -134,6 +135,31 @@ class VectorizedOrcReadSchemaSuite
   }
 }
 
+class MergedOrcReadSchemaSuite
+  extends ReadSchemaSuite
+  with AddColumnIntoTheMiddleTest
+  with HideColumnInTheMiddleTest
+  with AddNestedColumnTest
+  with HideNestedColumnTest
+  with ChangePositionTest
+  with BooleanTypeTest
+  with IntegralTypeTest
+  with ToDoubleTypeTest {
+
+  override val format: String = "orc"
+
+  override def beforeAll() {
+    super.beforeAll()
+    originalConf = spark.conf.get(SQLConf.ORC_SCHEMA_MERGING_ENABLED)
+    spark.conf.set(SQLConf.ORC_SCHEMA_MERGING_ENABLED.key, "true")
+  }
+
+  override def afterAll() {
+    spark.conf.set(SQLConf.ORC_SCHEMA_MERGING_ENABLED.key, originalConf)
+    super.afterAll()
+  }
+}
+
 class ParquetReadSchemaSuite

dongjoon-hyun · 2019-05-28T03:02:35Z

Lastly, it would be great if you can add some performance comparisons between Parquet/ORC merge schema in the PR description. This PR aims to add new features for ORC/Parquet feature parity. So, if there is a big slowness on new code, it's not desirable.

SparkQA · 2019-05-28T05:23:21Z

Test build #105853 has finished for PR 24043 at commit 7d833b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-26T04:27:00Z

Test build #106916 has finished for PR 24043 at commit 2ea9eb3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-26T05:07:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala

+  }
+
+  test("SPARK-11412 test enabling/disabling schema merging with data type conflicts") {
+    def testSchemaMergingWithDataTypeConflicts(expectedColumnNumber: Int): Unit = {


Nit: I don't think we need to make this a function. We can do it like this:

withTempDir { dir => spark.range(0, 10).toDF("a").write.. withSQLConf(SQLConf.ORC_SCHEMA_MERGING_ENABLED.key -> "true") { spark.read.. } withSQLConf(SQLConf.ORC_SCHEMA_MERGING_ENABLED.key -> "false") { spark.read.. } }

So that the test case doesn't need to write duplicated files twice.

yes, agree.

SparkQA · 2019-06-26T07:05:01Z

Test build #106921 has finished for PR 24043 at commit a6fc2d0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-26T07:05:01Z

Test build #106918 has finished for PR 24043 at commit 50c3906.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-06-26T07:07:07Z

Retest it please

WangGuangxin · 2019-06-28T05:24:17Z

Retest it please

@gengliangwang Could you please make jenkins retest this? Thanks

gengliangwang · 2019-06-28T05:28:36Z

retest this please.

SparkQA · 2019-06-28T07:05:02Z

Test build #106987 has finished for PR 24043 at commit a6fc2d0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-28T07:08:29Z

retest this please.

gengliangwang · 2019-06-28T07:09:00Z

Jenkins always restart at this time...

SparkQA · 2019-06-28T10:20:10Z

Test build #106988 has finished for PR 24043 at commit a6fc2d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-06-30T00:07:54Z

@WangGuangxin Could you submit a follow-up PR for updating the document? See the example https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging

gatorsmile · 2019-06-30T00:09:06Z

Thanks! Merged to master.

gatorsmile · 2019-06-30T00:13:54Z

Thanks for your work @WangGuangxin !

What is your JIRA account? We need to assign the assignee field to your JIRA account. https://issues.apache.org/jira/browse/SPARK-11412

WangGuangxin · 2019-06-30T02:27:43Z

Thanks for your work @WangGuangxin !

What is your JIRA account? We need to assign the assignee field to your JIRA account. https://issues.apache.org/jira/browse/SPARK-11412

Thanks. My jira account is EdisonWang

WangGuangxin · 2019-06-30T02:29:06Z

@WangGuangxin Could you submit a follow-up PR for updating the document? See the example https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging

ok

## What changes were proposed in this pull request? Currently, ORC's `inferSchema` is implemented as randomly choosing one ORC file and reading its schema. This PR follows the behavior of Parquet, it implements merge schemas logic by reading all ORC files in parallel through a spark job. Users can enable merge schema by `spark.read.orc("xxx").option("mergeSchema", "true")` or by setting `spark.sql.orc.mergeSchema` to `true`, the prior one has higher priority. ## How was this patch tested? tested by UT OrcUtilsSuite.scala Closes apache#24043 from WangGuangxin/SPARK-11412. Lead-authored-by: wangguangxin.cn <[email protected]> Co-authored-by: wangguangxin.cn <[email protected]> Signed-off-by: gatorsmile <[email protected]>

HyukjinKwon

@nchammas, are you interested in doing Python API change for this like #26730 as well?

nchammas · 2019-12-04T01:18:17Z

@HyukjinKwon - Sure, will do.

… APIs ### What changes were proposed in this pull request? This PR is a follow-up to #24043 and cousin of #26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes #26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… APIs ### What changes were proposed in this pull request? This PR is a follow-up to apache#24043 and cousin of apache#26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes apache#26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…p' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: #24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: #24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: #24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) **Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ |value| +-----+ | ac| | cc| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

WangGuangxin force-pushed the SPARK-11412 branch 2 times, most recently from 390639c to f7532a4 Compare March 10, 2019 11:02

dongjoon-hyun changed the title ~~[SPARK-11412][SQL]Support merge schema for ORC~~ [SPARK-11412][SQL] Support merge schema for ORC May 7, 2019

dongjoon-hyun reviewed May 7, 2019

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcFileFormatSuite.scala Outdated

Copy link

Member

dongjoon-hyun May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this test suite, please.

dongjoon-hyun requested changes May 7, 2019

View reviewed changes

WangGuangxin added 2 commits May 22, 2019 13:33

orc merge schema

083f6ec

update test suites

7d833b0

WangGuangxin force-pushed the SPARK-11412 branch from e84cdba to 7d833b0 Compare May 22, 2019 05:50

dongjoon-hyun reviewed May 28, 2019

View reviewed changes

dongjoon-hyun requested changes May 28, 2019

View reviewed changes

update test case

50c3906

gengliangwang reviewed Jun 26, 2019

View reviewed changes

update test cases

a6fc2d0

gatorsmile closed this in 73183b3 Jun 30, 2019

HyukjinKwon reviewed Dec 4, 2019

View reviewed changes

nchammas mentioned this pull request Dec 4, 2019

[SPARK-30113][SQL][Python] Expose mergeSchema option in PySpark's ORC APIs #26755

Closed

HyukjinKwon mentioned this pull request Dec 20, 2019

[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC #26958

Closed

[SPARK-11412][SQL] Support merge schema for ORC #24043

[SPARK-11412][SQL] Support merge schema for ORC #24043

Uh oh!

Conversation

WangGuangxin commented Mar 10, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

WangGuangxin commented Mar 12, 2019

Uh oh!

dongjoon-hyun commented May 7, 2019

Uh oh!

dongjoon-hyun May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

WangGuangxin commented May 22, 2019

Uh oh!

dongjoon-hyun commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

WangGuangxin commented May 22, 2019

Uh oh!

dongjoon-hyun commented May 22, 2019

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

WangGuangxin commented May 28, 2019

Uh oh!

dongjoon-hyun commented May 28, 2019

Uh oh!

dongjoon-hyun commented May 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 28, 2019

Uh oh!

SparkQA commented May 28, 2019

Uh oh!

SparkQA commented Jun 26, 2019

Uh oh!

gengliangwang Jun 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 26, 2019

Uh oh!

SparkQA commented Jun 26, 2019

Uh oh!

WangGuangxin commented Jun 26, 2019

Uh oh!

WangGuangxin commented Jun 28, 2019

Uh oh!

dongjoon-hyun May 7, 2019 •

edited

Loading

gengliangwang Jun 26, 2019 •

edited

Loading

gatorsmile commented Jun 30, 2019 •

edited

Loading