[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data Source Table with a Specified Path #16326

gatorsmile · 2016-12-18T04:06:35Z

What changes were proposed in this pull request?

In Spark 2.1 (the default of spark.sql.hive.manageFilesourcePartitions is set to true), if we create a parititoned data source table given a specified path, it returns nothing when we try to query it. To get the data, we have to manually issue a DDL to repair the table.

In Spark 2.0, it can return the data stored in the specified path, without repairing the table. In Spark 2.1, if we set spark.sql.hive.manageFilesourcePartitions to false, the behavior is the same as Spark 2.0.

Below is the output of Spark 2.1.

scala> spark.range(5).selectExpr("id as fieldOne", "id as partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
                                                                                
scala> spark.sql("desc formatted test").show(50, false)
+----------------------------+----------------------------------------------------------------------+-------+
|col_name                    |data_type                                                             |comment|
+----------------------------+----------------------------------------------------------------------+-------+
...
|Location:                   |file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test|       |
|Table Type:                 |MANAGED                                                               |       |
...
|Partition Provider:         |Catalog                                                               |       |
+----------------------------+----------------------------------------------------------------------+-------+


scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using parquet options (path 'file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test') partitioned by (partCol)")
res3: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
+--------+-------+

This PR is to make it consistent with the behavior of Spark 2.0. no matter whether spark.sql.hive.manageFilesourcePartitions is true or false. It repairs the table when creating such a table. After the change, the behavior becomes consistent with what we did for CTAS of partitioned data source tables.

How was this patch tested?

Modified the existing test case.

gatorsmile · 2016-12-18T04:08:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

  }

-  test("when partition management is disabled, we preserve the old behavior even for new tables") {
+  test("When partition management is disabled, we preserve the old behavior even for new tables") {


The old behavior is returning 5 rows.

gatorsmile · 2016-12-18T04:08:41Z

cc @ericl @cloud-fan @mallman

SparkQA · 2016-12-18T06:20:35Z

Test build #70315 has finished for PR 16326 at commit 3942c4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-18T07:43:28Z

If we want to make it consistent with the managed partitioned Hive serde table, the existing behavior is still not the same.

scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using parquet options (path 'file:/Users/xiaoli/sparkBin/spark-2.1.1-SNAPSHOT-bin-hadoop2.7/bin/spark-warehouse/test') partitioned by (partCol)")
res3: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
+--------+-------+

scala> spark.sql("insert into newTab values (213, 0)")
16/12/17 23:39:18 WARN log: Updating partition stats fast for: newtab
16/12/17 23:39:18 WARN log: Updated size to 766
res8: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
|     213|      0|
|       0|      0|
+--------+-------+

For a managed partitioned Hive serde table, the output should not contain the previous value. That means, it should output something like

+--------+-------+
|fieldOne|partCol|
+--------+-------+
|     213|      0|
+--------+-------+

gatorsmile · 2016-12-18T07:50:40Z

Instead of appending the new rows, Hive will overwrite the previous files in the specified location, even if we are using INSERT INTO. See the output

hive> create table test(c1 string) partitioned by (p1 string);
hive> desc formatted test;
...
Location:           	file:/user/hive/warehouse/test	 
Table Type:         	MANAGED_TABLE       	 
...
hive> create table newTab (c1 string) partitioned by (p1 string) location '/user/hive/warehouse/test';
hive> insert overwrite table test partition (p1='a') select 'bla';
hive> select * from test;
OK
bla	a
Time taken: 0.097 seconds, Fetched: 1 row(s)
hive> select * from newTab;
OK
Time taken: 0.076 seconds
hive> insert into table newTab partition (p1='a') select 'c';
Moving data to directory file:/user/hive/warehouse/test/p1=a/.hive-staging_hive_2016-12-18_07-47-16_045_3942279594280072795-1/-ext-10000
Loading data to table default.newtab partition (p1=a)
hive> select * from newTab;
OK
c	a
Time taken: 0.077 seconds, Fetched: 1 row(s)
hive> select * from test;
OK
c	a
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> insert into table newTab partition (p1='a') select 'ddd';
Moving data to directory file:/user/hive/warehouse/test/p1=a/.hive-staging_hive_2016-12-18_07-52-34_906_169097802553229329-1/-ext-10000
Loading data to table default.newtab partition (p1=a)
hive> select * from newTab;
OK
ddd	a
c	a
Time taken: 0.052 seconds, Fetched: 2 row(s)
hive> select * from test;
OK
ddd	a
c	a

ericl · 2016-12-19T08:57:17Z

hive> select * from test;
OK
ddd a
c a

Isn't this showing that hive is appending to the table (ddd, a) as expected with INSERT INTO?

scala> spark.sql("insert into newTab values (213, 0)")

For the (213, 0) example, is that just a bug?

ericl · 2016-12-19T20:07:03Z

Oh I see, you're saying if there are old files for the partition, the INSERT INTO will cause those to become visible. I agree this is confusing.

gatorsmile · 2016-12-20T01:28:12Z

Based on the discussion in #15983, we do not plan to add automatic table repairing. Let me close it first.

gatorsmile · 2016-12-20T01:30:04Z

We really need to improve the document, I think

gatorsmile added 2 commits December 17, 2016 19:55

fix.

40abcc2

fix.

3942c4e

gatorsmile commented Dec 18, 2016

View reviewed changes

gatorsmile closed this Dec 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data Source Table with a Specified Path #16326

[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data Source Table with a Specified Path #16326

Uh oh!

gatorsmile commented Dec 18, 2016

Uh oh!

gatorsmile Dec 18, 2016

Uh oh!

gatorsmile commented Dec 18, 2016

Uh oh!

SparkQA commented Dec 18, 2016

Uh oh!

gatorsmile commented Dec 18, 2016 •

edited

Loading

Uh oh!

gatorsmile commented Dec 18, 2016 •

edited

Loading

Uh oh!

ericl commented Dec 19, 2016 •

edited

Loading

Uh oh!

ericl commented Dec 19, 2016 •

edited

Loading

Uh oh!

gatorsmile commented Dec 20, 2016

Uh oh!

gatorsmile commented Dec 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data Source Table with a Specified Path #16326

[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data Source Table with a Specified Path #16326

Uh oh!

Conversation

gatorsmile commented Dec 18, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile Dec 18, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 18, 2016

Uh oh!

SparkQA commented Dec 18, 2016

Uh oh!

gatorsmile commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Dec 20, 2016

Uh oh!

gatorsmile commented Dec 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gatorsmile commented Dec 18, 2016 •

edited

Loading

gatorsmile commented Dec 18, 2016 •

edited

Loading

ericl commented Dec 19, 2016 •

edited

Loading

ericl commented Dec 19, 2016 •

edited

Loading