[SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables #29045

SaurabhChawla100 · 2020-07-08T19:39:40Z

What changes were proposed in this pull request?

Spark sql commands are failing on selecting the orc tables
Steps to reproduce
Example 1 -
Prerequisite - This is the location(/Users/test/tpcds_scale5data/date_dim) for orc data which is generated by the hive.

val table = """CREATE TABLE `date_dim` (
  `d_date_sk` INT,
  `d_date_id` STRING,
  `d_date` TIMESTAMP,
  `d_month_seq` INT,
  `d_week_seq` INT,
  `d_quarter_seq` INT,
  `d_year` INT,
  `d_dow` INT,
  `d_moy` INT,
  `d_dom` INT,
  `d_qoy` INT,
  `d_fy_year` INT,
  `d_fy_quarter_seq` INT,
  `d_fy_week_seq` INT,
  `d_day_name` STRING,
  `d_quarter_name` STRING,
  `d_holiday` STRING,
  `d_weekend` STRING,
  `d_following_holiday` STRING,
  `d_first_dom` INT,
  `d_last_dom` INT,
  `d_same_day_ly` INT,
  `d_same_day_lq` INT,
  `d_current_day` STRING,
  `d_current_week` STRING,
  `d_current_month` STRING,
  `d_current_quarter` STRING,
  `d_current_year` STRING)
USING orc
LOCATION '/Users/test/tpcds_scale5data/date_dim'"""

spark.sql(table).collect

val u = """select date_dim.d_date_id from date_dim limit 5"""

spark.sql(u).collect

Example 2

  val table = """CREATE TABLE `test_orc_data` (
  `_col1` INT,
  `_col2` STRING,
  `_col3` INT)
  USING orc"""

spark.sql(table).collect

spark.sql("insert into test_orc_data values(13, '155', 2020)").collect

val df = """select _col2 from test_orc_data limit 5"""
spark.sql(df).collect

Its Failing with below error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 1
    at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:133)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)`

The reason behind this initBatch is not getting the schema that is needed to find out the column value in OrcFileFormat.scala

batchReader.initBatch(
 TypeDescription.fromString(resultSchemaString)

Why are the changes needed?

Spark sql queries for orc tables are failing

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test is added for this .Also Tested through spark shell and spark submit the failing queries

dongjoon-hyun

Thank you for your contribution, @SaurabhChawla100 .
In order to prevent the future regression, could you make a UT with your example, please?

dongjoon-hyun · 2020-07-08T21:08:33Z

ok to test

SparkQA · 2020-07-08T21:58:35Z

Test build #125398 has finished for PR 29045 at commit 8e5b79e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-08T23:23:49Z

Retest this please.

SparkQA · 2020-07-09T01:11:44Z

Test build #125411 has finished for PR 29045 at commit 8e5b79e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SaurabhChawla100 · 2020-07-09T04:20:06Z

Thank you for your contribution, @SaurabhChawla100 .
In order to prevent the future regression, could you make a UT with your example, please?

Sure I will add the unit test

SparkQA · 2020-07-09T20:48:46Z

Test build #125502 has finished for PR 29045 at commit 75e8833.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-07-09T22:31:43Z

Retest this please.

SparkQA · 2020-07-10T04:44:48Z

Test build #125520 has finished for PR 29045 at commit 75e8833.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-10T04:51:36Z

retest this please

maropu · 2020-07-10T04:52:51Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+    assert(error == null)
+    spark.sql(s"DROP TABLE IF EXISTS test_date_spark_orc")
+  }
+


nit: remove this blank.

maropu · 2020-07-10T04:53:16Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

    }
  }
+
+  test("orc data created by the hive tables having _col fields name") {


Plz add the prefix test("SPARK-32234: orc data....

maropu · 2020-07-10T04:53:23Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+    spark.sql(s"DROP TABLE IF EXISTS test_date_hive_orc")
+  }
+
+  test("orc data created by the spark having proper fields name") {


maropu · 2020-07-10T05:03:19Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+        error = e
+    }
+    assert(error == null)
+    spark.sql(s"DROP TABLE IF EXISTS test_date_hive_orc")


Please check the other tests carefully then follow how to write tests there. How about refactoring it like this?

withTable("test_date_hive_orc") { spark.sql( s""" |CREATE TABLE test_date_hive_orc | (col1 INT, col2 STRING, col3 INT) | USING orc """.stripMargin) spark.sql( s""" |INSERT INTO test_date_hive_orc VALUES | (9, '12', 2020) """.stripMargin) val df = spark.sql("SELECT col2 FROM test_date_hive_orc") checkAnswer(df, Row(...)) }

Refactored the unit test

SparkQA · 2020-07-10T07:05:01Z

Test build #125555 has finished for PR 29045 at commit 75e8833.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-10T09:25:49Z

Test build #125593 has started for PR 29045 at commit 5b8edfb.

maropu · 2020-07-10T22:23:59Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+      assert(error == null)
+      spark.sql(
+        s"""
+           |DROP TABLE IF


You don't need to drop the table. Please see the implementation of withTable.

yes in withTable table its handled, Removed this drop table .

maropu · 2020-07-10T22:24:24Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+        case e: Throwable =>
+          error = e
+      }
+      assert(error == null)


I think you don't need this check.

Yes not required already handled in the framework in checkAnswer,I had removed it

SparkQA · 2020-07-11T07:05:02Z

Test build #125668 has finished for PR 29045 at commit c555f3c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-11T08:03:57Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

  }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields name") {
+    var error: Throwable = null


please remove this.

maropu · 2020-07-11T08:04:46Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+    withTable("test_date_hive_orc") {
+      spark.sql(
+        """
+          |CREATE TABLE `test_date_hive_orc`


nit: we need the backquotes?

not required

maropu · 2020-07-11T08:05:18Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+          |  values(9, '12', 2020)
+        """.stripMargin)
+
+      val df = spark.sql("select d_date_id from test_date_spark_orc")


nit: please use the uppercases for SQL keywrods (e.g., SELECT) where possible

maropu · 2020-07-11T08:08:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

+          // in the physical schema, there is a need to send the
+          // entire dataSchema instead of required schema
+          val orcFieldNames = reader.getSchema.getFieldNames.asScala
+          if (orcFieldNames.forall(_.startsWith("_col"))) {


What does this code mean? _col needs to be hard-coded?

So this is for a ORC file written by Hive, no field names in the physical schema. In that case it its having names like _col1, _col2 etc.

Check this code for reference

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

Line 133 in 84db660

if (orcFieldNames.forall(_.startsWith("_col"))) {

SparkQA · 2020-07-11T13:22:39Z

Test build #125675 has finished for PR 29045 at commit 95ca209.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you for updating, but could you update the PR description with the reproducible example? If someone is following the example, it will not fail because they don't have a data in /Users/test/tpcds_scale5data/date_dim. Also, please remove irrelevant stuff like TBLPROPERTIES.

dongjoon-hyun · 2020-07-11T14:06:43Z

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

+    }
+  }
+
+  test("SPARK-32234: orc data created by the spark having proper fields name") {


Shall we remove this test case because this test case pass without your patch? We already has a test coverage for this.

SparkQA · 2020-07-14T23:39:34Z

Test build #125850 has finished for PR 29045 at commit 9de3516.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-14T23:42:34Z

Test build #125851 has finished for PR 29045 at commit f8ece1f.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SaurabhChawla100 · 2020-07-15T05:54:55Z

Retest this please

cloud-fan · 2020-07-15T07:00:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

            isCaseSensitive, dataSchema, requiredSchema, reader, conf)
        }

+      if (!canPruneCols) {


nit: we can simplify the code a bit

val resultSchemaString = if (canPruneCols) { OrcUtils.orcTypeDescriptionString(resultSchema) } else { OrcUtils.orcTypeDescriptionString(StructType(dataSchema.fields ++ partitionSchema.fields)) }

Then we don't need to keep the val actualSchema =... and var resultSchemaString =... at the beginning.

cloud-fan · 2020-07-15T07:01:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

      reader: Reader,
-      conf: Configuration): Option[Array[Int]] = {
+      conf: Configuration): (Option[Array[Int]], Boolean) = {
+    var canPruneCols = true


do we really need it? We can just use boolean literal in the places that return the value.

cloud-fan · 2020-07-15T07:02:59Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala

+        withSQLConf(
+          SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+          SQLConf.ORC_VECTORIZED_READER_ENABLED.key -> vectorized) {
+          withTempPath { dir =>


nit: we don't need to provide a custom location. CREATE TABLE without LOCATION clause can also reproduce it.

cloud-fan · 2020-07-15T11:37:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

  /**
-   * Returns the requested column ids from the given ORC file. Column id can be -1, which means the
-   * requested column doesn't exist in the ORC file. Returns None if the given ORC file is empty.
+   * @return Returns the requested column ids from the given ORC file and Boolean flag to use actual


can we update the comment a little bit?

updated the comment

cloud-fan · 2020-07-15T11:42:15Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala

+      } else {
+        OrcUtils.orcTypeDescriptionString(StructType(dataSchema.fields ++ partitionSchema.fields))
+      }
+      OrcConf.MAPRED_INPUT_SCHEMA.setString(conf, resultSchemaString)


To avoid duplicated code, we can move these code to the requestedColumnIds method

def requestedColumnIds(..., partitionSchema: StructType): Option[Array[Int]] = { ... if (orcFieldNames.isEmpty) { None } else if (orcFieldNames.forall(_.startsWith("_col"))) { OrcConf.MAPRED_INPUT_SCHEMA.setString(conf, OrcUtils.orcTypeDescriptionString(StructType(dataSchema.fields ++ partitionSchema.fields))) ... } else { OrcConf.MAPRED_INPUT_SCHEMA.setString(conf, OrcUtils.orcTypeDescriptionString(StructType(requiredSchema.fields ++ partitionSchema.fields))) ... } }

@cloud-fan - In this we need to return the resultSchemaString from this method Option[(Array[Int], String)]
which is for else if (orcFieldNames.forall(_.startsWith("_col"))) {
val resultSchemaString = OrcUtils.orcTypeDescriptionString(StructType(dataSchema.fields ++ partitionSchema.fields)

else
val resultSchemaString = OrcUtils.orcTypeDescriptionString(StructType(requiredSchema.fields ++ partitionSchema.fields)))

since we are using this resultSchemaString in

batchReader.initBatch(
TypeDescription.fromString(resultSchemaString),
resultSchema.fields,

shall we make this change or create some helper method from the code in orc utils
val resultSchemaString =someMethod()
someMethod(): String {
val resultSchemaString = if (canPruneCols) {
OrcUtils.orcTypeDescriptionString(resultSchema)
} else {
OrcUtils.orcTypeDescriptionString(StructType(dataSchema.fields ++ partitionSchema.fields))
}
OrcConf.MAPRED_INPUT_SCHEMA.setString(conf, resultSchemaString)
resultSchemaString
}

adding a new helper method is also good.

Added the helper method

cloud-fan · 2020-07-15T11:43:37Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala

    }
  }
+
+  test("SPARK-32234: orc data created by the hive tables having _col fields name" +


we can shorten the test name: SPARK-32234: read ORC table with column names all starting with '_col'

cloud-fan

LGTM except some code style issues. Thanks for the fix!

cloud-fan · 2020-07-15T16:50:00Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala

+      val (requestedColIds, canPruneCols) = resultedColPruneInfo.get
+      val resultSchemaString = OrcUtils.orcResultSchemaString(canPruneCols,
+        dataSchema, resultSchema, partitionSchema, conf)
+      val requestedDataColIds = requestedColIds ++ Array.fill(partitionSchema.length)(-1)


I think we should switch the name. Here we add the partition column IDs and better to call it requestedColIds. The former one can be called requestedDataColIds as it doesn't contain partition columns.

switched the name

SparkQA · 2020-07-15T22:28:57Z

Test build #125892 has finished for PR 29045 at commit cf68729.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-16T03:17:07Z

Test build #125907 has finished for PR 29045 at commit 6150b08.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-16T12:00:12Z

Test build #125949 has finished for PR 29045 at commit c0f6209.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SaurabhChawla100 · 2020-07-16T12:55:33Z

@cloud-fan / @dongjoon-hyun - The test build is failing with following error

Test build #125949 has finished for PR 29045 at commit c0f6209.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

I am not sure if its related to change done on this PR

cloud-fan · 2020-07-16T13:10:51Z

All the github action checks passed, I think it's good to go. Merging to master/3.0!

…c tables ### What changes were proposed in this pull request? Spark sql commands are failing on selecting the orc tables Steps to reproduce Example 1 - Prerequisite - This is the location(/Users/test/tpcds_scale5data/date_dim) for orc data which is generated by the hive. ``` val table = """CREATE TABLE `date_dim` ( `d_date_sk` INT, `d_date_id` STRING, `d_date` TIMESTAMP, `d_month_seq` INT, `d_week_seq` INT, `d_quarter_seq` INT, `d_year` INT, `d_dow` INT, `d_moy` INT, `d_dom` INT, `d_qoy` INT, `d_fy_year` INT, `d_fy_quarter_seq` INT, `d_fy_week_seq` INT, `d_day_name` STRING, `d_quarter_name` STRING, `d_holiday` STRING, `d_weekend` STRING, `d_following_holiday` STRING, `d_first_dom` INT, `d_last_dom` INT, `d_same_day_ly` INT, `d_same_day_lq` INT, `d_current_day` STRING, `d_current_week` STRING, `d_current_month` STRING, `d_current_quarter` STRING, `d_current_year` STRING) USING orc LOCATION '/Users/test/tpcds_scale5data/date_dim'""" spark.sql(table).collect val u = """select date_dim.d_date_id from date_dim limit 5""" spark.sql(u).collect ``` Example 2 ``` val table = """CREATE TABLE `test_orc_data` ( `_col1` INT, `_col2` STRING, `_col3` INT) USING orc""" spark.sql(table).collect spark.sql("insert into test_orc_data values(13, '155', 2020)").collect val df = """select _col2 from test_orc_data limit 5""" spark.sql(df).collect ``` Its Failing with below error ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)` ``` The reason behind this initBatch is not getting the schema that is needed to find out the column value in OrcFileFormat.scala ``` batchReader.initBatch( TypeDescription.fromString(resultSchemaString) ``` ### Why are the changes needed? Spark sql queries for orc tables are failing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test is added for this .Also Tested through spark shell and spark submit the failing queries Closes #29045 from SaurabhChawla100/SPARK-32234. Lead-authored-by: SaurabhChawla <[email protected]> Co-authored-by: SaurabhChawla <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6be8b93) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2020-07-16T18:19:34Z

Thank you, @SaurabhChawla100 and @cloud-fan .

gatorsmile · 2020-07-24T07:55:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+   * @return Returns the result schema string based on the canPruneCols flag.
+   *         resultSchemaString will be created using resultsSchema in case of
+   *         canPruneCols is true and for canPruneCols as false value
+   *         resultSchemaString will be created using the actual dataSchema.


This description is not clear enough. This utility function also changed the value of conf. We need to document it.

@SaurabhChawla100 Could you submit a follow-up PR to improve the description?

@gatorsmile - This is the new helper method that we have added as the part of this PR

sure I Will update the description in the follow-up PR . Shall I raised the PR against the new Jira or with this same jira . Since this Jira is already resolved

Shall I raised the PR against the new Jira or with this same jira

Its okay to refer to this JIRA ticket. Then, please add [FOLLOWUP] in the PR title.

### What changes were proposed in this pull request? As the part of this PR #29045 added the helper method. This PR is the FOLLOWUP PR to update the description of helper method. ### Why are the changes needed? For better readability and understanding of the code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since its only change of updating the description , So ran the Spark shell Closes #29232 from SaurabhChawla100/SPARK-32234-Desc. Authored-by: SaurabhChawla <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

probot-autolabeler bot added the SQL label Jul 8, 2020

SaurabhChawla100 changed the title ~~WIP: add the code change to fix the sql query~~ WIP: [SPARK-32234] Spark sql commands are failing on selecting the orc tables Jul 8, 2020

SaurabhChawla100 changed the title ~~WIP: [SPARK-32234] Spark sql commands are failing on selecting the orc tables~~ [SPARK-32234] Spark sql commands are failing on selecting the orc tables Jul 8, 2020

dongjoon-hyun requested changes Jul 8, 2020

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-32234] Spark sql commands are failing on selecting the orc tables~~ [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables Jul 8, 2020

maropu reviewed Jul 10, 2020

View reviewed changes

maropu reviewed Jul 11, 2020

View reviewed changes

dongjoon-hyun reviewed Jul 11, 2020

View reviewed changes

SaurabhChawla100 added 2 commits July 14, 2020 23:48

reafactor the code and add similar logic in file source v2 code path

9de3516

refactor the code

f8ece1f

cloud-fan reviewed Jul 15, 2020

View reviewed changes

SaurabhChawla100 and others added 3 commits July 15, 2020 14:09

Refactor some method return type and use it the code

5b8715e

Removed the code of location for create table in unit test

d0f6b9b

fix the scala style in the unit test

c79794f

cloud-fan reviewed Jul 15, 2020

View reviewed changes

cloud-fan approved these changes Jul 15, 2020

View reviewed changes

created new method for duplicate code

cf68729

cloud-fan reviewed Jul 15, 2020

View reviewed changes

switch the name of requestedColIds and requestedDataColIds

6150b08

fix the name of the unit test

c0f6209

cloud-fan closed this in 6be8b93 Jul 16, 2020

gatorsmile reviewed Jul 24, 2020

View reviewed changes

SaurabhChawla100 mentioned this pull request Jul 25, 2020

[SPARK-32234][FOLLOWUP][SQL]Update the description of utility method #29232

Closed

[SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables #29045

[SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables #29045

Uh oh!

Conversation

SaurabhChawla100 commented Jul 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 8, 2020

Uh oh!

SparkQA commented Jul 8, 2020

Uh oh!

dongjoon-hyun commented Jul 8, 2020

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

SaurabhChawla100 commented Jul 9, 2020

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

dongjoon-hyun commented Jul 9, 2020

Uh oh!

SparkQA commented Jul 10, 2020

Uh oh!

maropu commented Jul 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2020

Uh oh!

SparkQA commented Jul 10, 2020

Uh oh!

maropu Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaurabhChawla100 commented Jul 8, 2020 •

edited

Loading

maropu Jul 10, 2020 •

edited

Loading

SaurabhChawla100 Jul 11, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Jul 11, 2020 •

edited

Loading

cloud-fan Jul 15, 2020 •

edited

Loading