[SPARK-19575][SQL]Reading from or writing to a hive serde table with a non pre-existing location should succeed #16910

windpiger · 2017-02-13T07:54:02Z

What changes were proposed in this pull request?

This PR is a folllowup work from SPARK-19329 - PR(#16672), which has unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton, so here we should also unify the hive serde tables.

That is :

reading from a hive serde table: return 0 rows
writing to a hive serde table: write data successfully

Currently when we select from a hive serde table which has a non pre-existing location will throw an exception:

Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2080)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:258)

How was this patch tested?

unit tests added

…a non pre-existing location should succeed

SparkQA · 2017-02-13T07:58:33Z

Test build #72808 has started for PR 16910 at commit cb98375.

windpiger · 2017-02-13T08:53:26Z

retest this please

SparkQA · 2017-02-13T10:39:29Z

Test build #72812 has finished for PR 16910 at commit cb98375.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-13T11:02:11Z

cc @gatorsmile @cloud-fan

SparkQA · 2017-02-13T14:12:00Z

Test build #72815 has finished for PR 16910 at commit 401e86d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-17T14:30:16Z

@gatorsmile could you help to review this? thanks :)

gatorsmile · 2017-02-20T18:53:58Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+        dir.delete()
+        checkAnswer(spark.table("t"), Nil)
+
+        val newDir = dir.getAbsolutePath.stripSuffix("/") + "/x"


-> new File(dir, "x")

gatorsmile · 2017-02-20T18:54:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+    }
+  }
+
+  test("read data from a hive serde table which has a not existed location should succeed") {


This is the only test case failed without this fix. Right?

Yes, it is~

gatorsmile · 2017-02-20T18:55:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+          s"""
+              |CREATE TABLE t(a string, b int)
+              |USING hive
+              |OPTIONS(path "file:${dir.getAbsolutePath}")


Indent issues.

gatorsmile · 2017-02-20T19:06:59Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala

+
+    // if the table location is not exists, return an empty RDD
+    if (!fs.exists(locationPath)) {
+      return new EmptyRDD[InternalRow](sparkSession.sparkContext)


Can we do it in makeRDDForTable?

I do it here for both non-partition table and partition table，while the partition table run well when the location does not exist with verifyPartitionPath set true，If we also want to run well when verifyPartitionPath set false， we should also do it in makeRDDForPartitionTable， then under this situation I do it here for both non-partition table and partition table.

It seems not reasonable to do this for partition table, because the real partition path maybe not under the location of the partition table, I moved this logic to makeRDDForTable

SparkQA · 2017-02-21T02:23:43Z

Test build #73191 has finished for PR 16910 at commit 4493a8f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-21T03:23:38Z

retest this please

gatorsmile · 2017-02-21T03:29:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

+    val locationPath = new Path(inputPathStr)
+    val fs = locationPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
+
+    // if the table location is not exists, return an empty RDD


is not exists -> does not exist

SparkQA · 2017-02-21T04:51:19Z

Test build #73193 has finished for PR 16910 at commit 4493a8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T04:54:04Z

Test build #73194 has finished for PR 16910 at commit 6fb2b57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-21T04:58:32Z

Test build #73196 has finished for PR 16910 at commit b4caca7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-22T05:02:24Z

@gatorsmile I have fixed the review above~

gatorsmile · 2017-02-22T06:08:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

-      HadoopTableReader.fillObject(iter, deserializer, attrsWithIndex, mutableRow, deserializer)
-    }
+    val locationPath = new Path(inputPathStr)
+    val fs = locationPath.getFileSystem(sparkSession.sessionState.newHadoopConf())


How about replacing sparkSession.sessionState.newHadoopConf() by broadcastedHadoopConf.value.value?

gatorsmile · 2017-02-22T06:09:40Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+             |PARTITIONED BY(a, b)
+             |LOCATION "file:${dir.getCanonicalPath}"
+           """.stripMargin)
+        val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))


gatorsmile · 2017-02-22T06:12:05Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+
+        val newDirFile = new File(dir, "x")
+        spark.sql(s"ALTER TABLE t PARTITION(a=1, b=2) SET LOCATION " +
+          s"'${newDirFile.getAbsolutePath}'")


shorten it to a single line?

101 characters...
let me modify some code.

Actually, 101 is still ok

e...is't it 100? let me test it...
I have modify some code to make it moer clear

gatorsmile · 2017-02-22T06:18:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+
+        val newDirFile = new File(dir, "x")
+        spark.sql(s"ALTER TABLE t PARTITION(a=1, b=2) SET LOCATION " +
+          s"'${newDirFile.getAbsolutePath}'")


shorten it to a single line?

gatorsmile · 2017-02-22T06:26:57Z

LGTM except a few minor comments.

SparkQA · 2017-02-22T06:57:33Z

Test build #73266 has started for PR 16910 at commit 119fa64.

windpiger · 2017-02-22T08:15:53Z

retest this please

windpiger · 2017-03-01T04:22:05Z

retest this please

SparkQA · 2017-03-01T04:30:26Z

Test build #73657 has finished for PR 16910 at commit f83d81d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T04:40:16Z

Test build #73661 has finished for PR 16910 at commit 2456a94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T04:44:47Z

Test build #73663 has finished for PR 16910 at commit 2456a94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T06:34:57Z

Test build #73666 has finished for PR 16910 at commit f4b4d29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-01T08:16:20Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+        val fs = dirPath.getFileSystem(spark.sessionState.newHadoopConf())
+        assert(new Path(table.location) == fs.makeQualified(dirPath))
+
+        val tableLocFile = new File(table.location.stripPrefix("file:"))


new File(new URI(table.location))? please avoid .stripPrefix("file:") which looks very hacky.

cloud-fan · 2017-03-01T08:17:51Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+        checkAnswer(spark.table("t"), Row("c", 1) :: Nil)
+
+        val newDirFile = new File(dir, "x")
+        val newDirPath = newDirFile.getAbsolutePath.stripSuffix("/")


.stripSuffix("/") is it needed?

SparkQA · 2017-03-01T09:52:16Z

Test build #73677 has finished for PR 16910 at commit 3dcd6c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-02T06:52:21Z

can you resolve the conflict?

windpiger · 2017-03-02T06:55:28Z

ok, do it now ~ yesterday is ok...

SparkQA · 2017-03-02T07:07:37Z

Test build #73735 has started for PR 16910 at commit a4f771a.

cloud-fan · 2017-03-02T07:09:45Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+          s"""
+             |CREATE TABLE t(a string, b int)
+             |USING hive
+             |LOCATION '$dir'


can we just call dir.delete before creating this table?

@cloud-fan I found the dir will be created in create table, so we should keep current logic.

does hive have the same behavior?

yes, I test it in Hive

create table test(a string) location 'hdfs:/xx';

then hdfs:/xx will be created

seems the InMemoryCatalog doesn't do this, you can send a new PR to fix it.

ok thanks～

windpiger · 2017-03-02T09:20:56Z

retest this please

SparkQA · 2017-03-02T10:44:08Z

Test build #73747 has finished for PR 16910 at commit a4f771a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-03T07:57:32Z

Test build #73829 has started for PR 16910 at commit 15c0a77.

windpiger · 2017-03-03T08:06:00Z

retest this please

SparkQA · 2017-03-03T09:29:39Z

Test build #73831 has finished for PR 16910 at commit 15c0a77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-03T19:58:21Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+        spark.sql(s"ALTER TABLE t PARTITION(a=1, b=2) SET LOCATION '$newDirPath'")
+        assert(!newDirFile.exists())
+        // select from a partition which location has changed to a not existed location
+        withSQLConf(SQLConf.HIVE_VERIFY_PARTITION_PATH.key -> "true") {


why setting this conf?

if we don't set it，it will throw an exception，if we set it，it will check if the partition path exists，and will not throw exception just return emptyrdd even if path not existed

is this expected? I think hive will always return empty result right?

BTW this conf will be removed soon, as it has bugs.

ok～thanks～ then here we also need to modify something？

Yes, hive return empty , if there is a bug here(could you describe what the bug is?), we can remove the conf ,and always return result?

gatorsmile · 2017-10-28T04:36:24Z

Should we just close it now?

SparkQA · 2018-05-31T21:31:45Z

Test build #91357 has finished for PR 16910 at commit 15c0a77.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

Closes apache#17422 Closes apache#17619 Closes apache#18034 Closes apache#18229 Closes apache#18268 Closes apache#17973 Closes apache#18125 Closes apache#18918 Closes apache#19274 Closes apache#19456 Closes apache#19510 Closes apache#19420 Closes apache#20090 Closes apache#20177 Closes apache#20304 Closes apache#20319 Closes apache#20543 Closes apache#20437 Closes apache#21261 Closes apache#21726 Closes apache#14653 Closes apache#13143 Closes apache#17894 Closes apache#19758 Closes apache#12951 Closes apache#17092 Closes apache#21240 Closes apache#16910 Closes apache#12904 Closes apache#21731 Closes apache#21095 Added: Closes apache#19233 Closes apache#20100 Closes apache#21453 Closes apache#21455 Closes apache#18477 Added: Closes apache#21812 Closes apache#21787 Author: hyukjinkwon <[email protected]> Closes apache#21781 from HyukjinKwon/closing-prs.

[SPARK-19575][SQL]Reading from or writing to a hive serde table with …

cb98375

…a non pre-existing location should succeed

remove some comment

401e86d

gatorsmile reviewed Feb 20, 2017

View reviewed changes

mv the logic to makeRDDForTable

4493a8f

remove redundant import

6fb2b57

gatorsmile reviewed Feb 21, 2017

View reviewed changes

fix a comment

b4caca7

gatorsmile reviewed Feb 22, 2017

View reviewed changes

modify some test case code

119fa64

fix test failed

2456a94

fix test failed

f4b4d29

cloud-fan reviewed Mar 1, 2017

View reviewed changes

modify some tests

3dcd6c6

windpiger added 2 commits March 2, 2017 14:55

Merge branch 'master' into selectHiveFromNotExistLocation

065af19

merge with master

a4f771a

cloud-fan reviewed Mar 2, 2017

View reviewed changes

add makeQualifiedPath func

15c0a77

cloud-fan reviewed Mar 3, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Jul 16, 2018

[INFRA] Close stale PR #21781

Closed

asfgit closed this in 1a4fda8 Jul 19, 2018

[SPARK-19575][SQL]Reading from or writing to a hive serde table with a non pre-existing location should succeed #16910

[SPARK-19575][SQL]Reading from or writing to a hive serde table with a non pre-existing location should succeed #16910

Uh oh!

Conversation

windpiger commented Feb 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 13, 2017

Uh oh!

windpiger commented Feb 13, 2017

Uh oh!

SparkQA commented Feb 13, 2017

Uh oh!

windpiger commented Feb 13, 2017

Uh oh!

SparkQA commented Feb 13, 2017

Uh oh!

windpiger commented Feb 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Feb 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 21, 2017

Uh oh!

gatorsmile commented Feb 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 21, 2017

Uh oh!

SparkQA commented Feb 21, 2017

Uh oh!

SparkQA commented Feb 21, 2017

Uh oh!

windpiger commented Feb 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 22, 2017

Uh oh!

SparkQA commented Feb 22, 2017

Uh oh!

windpiger commented Feb 22, 2017

Uh oh!

windpiger commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

windpiger commented Feb 13, 2017 •

edited

Loading

windpiger commented Feb 17, 2017 •

edited

Loading

gatorsmile Feb 20, 2017 •

edited

Loading

windpiger Mar 2, 2017 •

edited

Loading