Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

In Spark 2.1 (the default of spark.sql.hive.manageFilesourcePartitions is set to true), if we create a parititoned data source table given a specified path, it returns nothing when we try to query it. To get the data, we have to manually issue a DDL to repair the table.

In Spark 2.0, it can return the data stored in the specified path, without repairing the table. In Spark 2.1, if we set spark.sql.hive.manageFilesourcePartitions to false, the behavior is the same as Spark 2.0.

Below is the output of Spark 2.1.

scala> spark.range(5).selectExpr("id as fieldOne", "id as partCol").write.partitionBy("partCol").mode("overwrite").saveAsTable("test")
                                                                                
scala> spark.sql("desc formatted test").show(50, false)
+----------------------------+----------------------------------------------------------------------+-------+
|col_name                    |data_type                                                             |comment|
+----------------------------+----------------------------------------------------------------------+-------+
...
|Location:                   |file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test|       |
|Table Type:                 |MANAGED                                                               |       |
...
|Partition Provider:         |Catalog                                                               |       |
+----------------------------+----------------------------------------------------------------------+-------+


scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using parquet options (path 'file:/Users/xiaoli/IdeaProjects/sparkDelivery/bin/spark-warehouse/test') partitioned by (partCol)")
res3: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
+--------+-------+

This PR is to make it consistent with the behavior of Spark 2.0. no matter whether spark.sql.hive.manageFilesourcePartitions is true or false. It repairs the table when creating such a table. After the change, the behavior becomes consistent with what we did for CTAS of partitioned data source tables.

How was this patch tested?

Modified the existing test case.

}

test("when partition management is disabled, we preserve the old behavior even for new tables") {
test("When partition management is disabled, we preserve the old behavior even for new tables") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old behavior is returning 5 rows.

@gatorsmile
Copy link
Member Author

cc @ericl @cloud-fan @mallman

@SparkQA
Copy link

SparkQA commented Dec 18, 2016

Test build #70315 has finished for PR 16326 at commit 3942c4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

gatorsmile commented Dec 18, 2016

If we want to make it consistent with the managed partitioned Hive serde table, the existing behavior is still not the same.

scala> spark.sql(s"create table newTab (fieldOne long, partCol int) using parquet options (path 'file:/Users/xiaoli/sparkBin/spark-2.1.1-SNAPSHOT-bin-hadoop2.7/bin/spark-warehouse/test') partitioned by (partCol)")
res3: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
+--------+-------+

scala> spark.sql("insert into newTab values (213, 0)")
16/12/17 23:39:18 WARN log: Updating partition stats fast for: newtab
16/12/17 23:39:18 WARN log: Updated size to 766
res8: org.apache.spark.sql.DataFrame = []

scala> spark.table("newTab").show()
+--------+-------+
|fieldOne|partCol|
+--------+-------+
|     213|      0|
|       0|      0|
+--------+-------+

For a managed partitioned Hive serde table, the output should not contain the previous value. That means, it should output something like

+--------+-------+
|fieldOne|partCol|
+--------+-------+
|     213|      0|
+--------+-------+

@gatorsmile
Copy link
Member Author

gatorsmile commented Dec 18, 2016

Instead of appending the new rows, Hive will overwrite the previous files in the specified location, even if we are using INSERT INTO. See the output

hive> create table test(c1 string) partitioned by (p1 string);
hive> desc formatted test;
...
Location:           	file:/user/hive/warehouse/test	 
Table Type:         	MANAGED_TABLE       	 
...
hive> create table newTab (c1 string) partitioned by (p1 string) location '/user/hive/warehouse/test';
hive> insert overwrite table test partition (p1='a') select 'bla';
hive> select * from test;
OK
bla	a
Time taken: 0.097 seconds, Fetched: 1 row(s)
hive> select * from newTab;
OK
Time taken: 0.076 seconds
hive> insert into table newTab partition (p1='a') select 'c';
Moving data to directory file:/user/hive/warehouse/test/p1=a/.hive-staging_hive_2016-12-18_07-47-16_045_3942279594280072795-1/-ext-10000
Loading data to table default.newtab partition (p1=a)
hive> select * from newTab;
OK
c	a
Time taken: 0.077 seconds, Fetched: 1 row(s)
hive> select * from test;
OK
c	a
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive> insert into table newTab partition (p1='a') select 'ddd';
Moving data to directory file:/user/hive/warehouse/test/p1=a/.hive-staging_hive_2016-12-18_07-52-34_906_169097802553229329-1/-ext-10000
Loading data to table default.newtab partition (p1=a)
hive> select * from newTab;
OK
ddd	a
c	a
Time taken: 0.052 seconds, Fetched: 2 row(s)
hive> select * from test;
OK
ddd	a
c	a

@ericl
Copy link
Contributor

ericl commented Dec 19, 2016

hive> select * from test;
OK
ddd a
c a

Isn't this showing that hive is appending to the table (ddd, a) as expected with INSERT INTO?

scala> spark.sql("insert into newTab values (213, 0)")

For the (213, 0) example, is that just a bug?

@ericl
Copy link
Contributor

ericl commented Dec 19, 2016

Oh I see, you're saying if there are old files for the partition, the INSERT INTO will cause those to become visible. I agree this is confusing.

@gatorsmile
Copy link
Member Author

Based on the discussion in #15983, we do not plan to add automatic table repairing. Let me close it first.

@gatorsmile gatorsmile closed this Dec 20, 2016
@gatorsmile
Copy link
Member Author

We really need to improve the document, I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants