-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand #15996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The first commit is from another PR and you can ignore it. Do you think we should target this ticket to 2.1? It's kind of a refactor but do fix some problems. |
|
Test build #69084 has finished for PR 15996 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change, I like it! Now it is identical to the interface CreateHiveTableAsSelectCommand. Maybe we can copy the params here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This todo is still valid?
|
A general issue: after we moving the execution of |
|
@gatorsmile , can you point out which verification we need to add back from |
|
Test build #70026 has finished for PR 15996 at commit
|
|
Test build #70061 has started for PR 15996 at commit |
|
retest this please |
|
Test build #70074 has finished for PR 15996 at commit
|
| throw new AnalysisException( | ||
| s"The column number of the existing schema[$existingSchema] " + | ||
| s"doesn't match the data schema[${df.logicalPlan.schema}]") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, this fixes a bug. Before this PR, we only check the size when the target is the LogicalRelation.
| // Because we are inserting into an existing table, we should respect the existing | ||
| // schema and adjust columns order of the given dataframe according to it. | ||
| df.select(existingSchema.map(f => Column(f.name)): _*) | ||
| .write.insertInto(tableIdentWithDB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's ok to analyze twice, but not analyze an optimized plan, let me look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I made the mistake here. I deleted the comment after I realized it.
assertNotBucketed("insertInto") is missing here. This is an existing bug, right?
1efb892 to
6c64007
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 checkings are newly added, previously we ignore the user specified partition columns and bucket silently, now we will log a warning message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted #15983 here because it's not needed anymore after this refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part of that pr is reverted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of it, except the test
|
Test build #70187 has finished for PR 15996 at commit
|
|
Test build #70188 has finished for PR 15996 at commit
|
|
Test build #70305 has finished for PR 15996 at commit
|
|
Could we update the PR description and add the test case in |
28f88ef to
97dc307
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change, we always go to createRelation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently only 2 data sources accept saveAsTable with append mode: CreatableRelationProvider and FileFormat. For CreatableRelationProvider, we always go to createRelation, for FileFormat, we go to InsertIntoHadoopFsRelation, which is same with InsertIntoTable. That's why I add the if-else here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we always use SaveMode.Overwrite at here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are creating a new table, the data dir is empty, ideally we can use whatever mode. Maybe use ErrorIfExists is safer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do not need this anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, we are checking the number of rows before the msck, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also explain why we only see newly written partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with the behavior of InsertItoTable. I'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to also explain the reason that we use (3, 13) in the comment.
|
Test build #70532 has finished for PR 15996 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment. Let's explicitly say that we want to test the case that a data source is a CreatableRelationProvider but its relation does not implement InsertableRelation.
|
LGTM pending jenkins. Can you update the comment to address my last comment (#15996 (comment))? |
|
ah 9a1ad71 failed. |
|
Test build #70633 has finished for PR 15996 at commit
|
|
Test build #70634 has finished for PR 15996 at commit
|
|
Test build #70637 has finished for PR 15996 at commit
|
|
Test build #70654 has finished for PR 15996 at commit
|
| tableLocation: Option[String], | ||
| data: LogicalPlan, | ||
| mode: SaveMode): BaseRelation = { | ||
| // Create the relation based on the data of df. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Need an update in the comment.
| EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match { | ||
| // Only do the check if the table is a data source table (the relation is a BaseRelation). | ||
| case LogicalRelation(dest: BaseRelation, _, _) => | ||
| if (srcRelations.contains(dest)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
case LogicalRelation(dest: BaseRelation, _, _) if srcRelations.contains(dest) =>
throw new AnalysisException(...| case SaveMode.Append => | ||
| val existingTable = sessionState.catalog.getTableMetadata(tableIdentWithDB) | ||
| val result = if (sessionState.catalog.tableExists(tableIdentWithDB)) { | ||
| assert(mode != SaveMode.Overwrite, "analyzer will drop the table to overwrite it.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about s"Expect the table $tableName has been dropped when the save mode is Overwrite"?
|
LGTM except three minor comments |
|
Test build #70672 has finished for PR 15996 at commit
|
|
LGTM |
|
LGTM. Merging to master. |
## What changes were proposed in this pull request? The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists: 1. throw exception if we don't want to ignore it. 2. do some check and adjust the schema if we want to append data. 3. drop the table and create it again if we want to overwrite. The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#15996 from cloud-fan/append.
## What changes were proposed in this pull request? The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists: 1. throw exception if we don't want to ignore it. 2. do some check and adjust the schema if we want to append data. 3. drop the table and create it again if we want to overwrite. The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#15996 from cloud-fan/append.
What changes were proposed in this pull request?
The
CreateDataSourceTableAsSelectCommandis quite complex now, as it has a lot of work to do if the table already exists:The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables.
How was this patch tested?
existing tests.