-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18915] [SQL] Automatic Table Repair when Creating a Partitioned Data Source Table with a Specified Path #16326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| test("when partition management is disabled, we preserve the old behavior even for new tables") { | ||
| test("When partition management is disabled, we preserve the old behavior even for new tables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old behavior is returning 5 rows.
|
Test build #70315 has finished for PR 16326 at commit
|
|
If we want to make it consistent with the managed partitioned Hive serde table, the existing behavior is still not the same. For a managed partitioned Hive serde table, the output should not contain the previous value. That means, it should output something like |
|
Instead of appending the new rows, Hive will overwrite the previous files in the specified location, even if we are using |
Isn't this showing that hive is appending to the table (ddd, a) as expected with INSERT INTO?
For the (213, 0) example, is that just a bug? |
|
Oh I see, you're saying if there are old files for the partition, the INSERT INTO will cause those to become visible. I agree this is confusing. |
|
Based on the discussion in #15983, we do not plan to add automatic table repairing. Let me close it first. |
|
We really need to improve the document, I think |
What changes were proposed in this pull request?
In Spark 2.1 (the default of
spark.sql.hive.manageFilesourcePartitionsis set totrue), if we create a parititoned data source table given a specified path, it returns nothing when we try to query it. To get the data, we have to manually issue a DDL to repair the table.In Spark 2.0, it can return the data stored in the specified path, without repairing the table. In Spark 2.1, if we set
spark.sql.hive.manageFilesourcePartitionsto false, the behavior is the same as Spark 2.0.Below is the output of Spark 2.1.
This PR is to make it consistent with the behavior of Spark 2.0. no matter whether
spark.sql.hive.manageFilesourcePartitionsistrueorfalse. It repairs the table when creating such a table. After the change, the behavior becomes consistent with what we did for CTAS of partitioned data source tables.How was this patch tested?
Modified the existing test case.