-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18183] [SPARK-18184] Fix INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables #15705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @yhuai |
|
Test build #67846 has finished for PR 15705 at commit
|
|
Test build #67848 has finished for PR 15705 at commit
|
|
Test build #67849 has finished for PR 15705 at commit
|
| val path = new Path(p.storage.locationUri.get) | ||
| val fs = path.getFileSystem(hadoopConf) | ||
| PartitionPath( | ||
| p.toRow(partitionSchema), path.makeQualified(fs.getUri, fs.getWorkingDirectory)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change? Doesn't new Path qualify the path string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently not. The unit test actually fails if you do that, since the path seems to be missing the file: prefix and we fail to find the files in the partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we will qualify it before writing to it at here, doesn't it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the user can store arbitrary string paths with ALTER TABLE PARTITIONS SET LOCATION. Therefore, we must manually qualify the locations that come from the catalog or else they might not necessarily match up with the paths read from the filesystem.
| "Cannot overwrite a path that is also being read from.") | ||
| } | ||
|
|
||
| val overwritePartitionPath = if (overwrite.specificPartition.isDefined && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just pass the partition path as outputPath to InsertIntoHadoopFsRelationCommand and set partition columns to Nil? then we don't need to add an extra parameter to InsertIntoHadoopFsRelationCommand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems a little cleaner.
|
Test build #67864 has finished for PR 15705 at commit
|
|
Test build #67911 has finished for PR 15705 at commit
|
| OverwriteOptions( | ||
| overwrite, | ||
| if (overwrite && partition.nonEmpty) { | ||
| Some(partition.map(kv => (kv._1, kv._2.get))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to consider dynamic partition here?
|
I think you don't have to since this is just the test suite. On Tue, Nov 1, 2016, 8:49 PM Wenchen Fan [email protected] wrote:
|
|
LGTM, merging to master! |
…ITION for Datasource tables There are a couple issues with the current 2.1 behavior when inserting into Datasource tables with partitions managed by Hive. (1) OVERWRITE TABLE ... PARTITION will actually overwrite the entire table instead of just the specified partition. (2) INSERT|OVERWRITE does not work with partitions that have custom locations. This PR fixes both of these issues for Datasource tables managed by Hive. The behavior for legacy tables or when `manageFilesourcePartitions = false` is unchanged. There is one other issue in that INSERT OVERWRITE with dynamic partitions will overwrite the entire table instead of just the updated partitions, but this behavior is pretty complicated to implement for Datasource tables. We should address that in a future release. Unit tests. Author: Eric Liang <[email protected]> Closes #15705 from ericl/sc-4942. (cherry picked from commit abefe2e) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? `OverwriteOptions` was introduced in #15705, to carry the information of static partitions. However, after further refactor, this information becomes duplicated and we can remove `OverwriteOptions`. ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #15995 from cloud-fan/overwrite.
## What changes were proposed in this pull request? `OverwriteOptions` was introduced in apache#15705, to carry the information of static partitions. However, after further refactor, this information becomes duplicated and we can remove `OverwriteOptions`. ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes apache#15995 from cloud-fan/overwrite.
…ITION for Datasource tables ## What changes were proposed in this pull request? There are a couple issues with the current 2.1 behavior when inserting into Datasource tables with partitions managed by Hive. (1) OVERWRITE TABLE ... PARTITION will actually overwrite the entire table instead of just the specified partition. (2) INSERT|OVERWRITE does not work with partitions that have custom locations. This PR fixes both of these issues for Datasource tables managed by Hive. The behavior for legacy tables or when `manageFilesourcePartitions = false` is unchanged. There is one other issue in that INSERT OVERWRITE with dynamic partitions will overwrite the entire table instead of just the updated partitions, but this behavior is pretty complicated to implement for Datasource tables. We should address that in a future release. ## How was this patch tested? Unit tests. Author: Eric Liang <[email protected]> Closes apache#15705 from ericl/sc-4942.
## What changes were proposed in this pull request? `OverwriteOptions` was introduced in apache#15705, to carry the information of static partitions. However, after further refactor, this information becomes duplicated and we can remove `OverwriteOptions`. ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes apache#15995 from cloud-fan/overwrite.
What changes were proposed in this pull request?
There are a couple issues with the current 2.1 behavior when inserting into Datasource tables with partitions managed by Hive.
(1) OVERWRITE TABLE ... PARTITION will actually overwrite the entire table instead of just the specified partition.
(2) INSERT|OVERWRITE does not work with partitions that have custom locations.
This PR fixes both of these issues for Datasource tables managed by Hive. The behavior for legacy tables or when
manageFilesourcePartitions = falseis unchanged.There is one other issue in that INSERT OVERWRITE with dynamic partitions will overwrite the entire table instead of just the updated partitions, but this behavior is pretty complicated to implement for Datasource tables. We should address that in a future release.
How was this patch tested?
Unit tests.