-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14922][SPARK-17732][SQL]ALTER TABLE DROP PARTITION should support comparators #19691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for pinging me, @gatorsmile . |
|
ok to test |
|
Test build #83828 has finished for PR 19691 at commit
|
|
Test build #83831 has finished for PR 19691 at commit
|
|
Test build #83832 has finished for PR 19691 at commit
|
|
Jenkins, retest this please |
|
Test build #83838 has finished for PR 19691 at commit
|
|
Test build #83839 has finished for PR 19691 at commit
|
|
@gatorsmile @dongjoon-hyun |
|
ok to test |
| expression(pVal) match { | ||
| case EqualNullSafe(_, _) => | ||
| throw new ParseException("'<=>' operator is not allowed in partition specification.", ctx) | ||
| case cmp @ BinaryComparison(UnresolvedAttribute(name :: Nil), constant: Literal) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still the same question here. Constant has to be in the right side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive supports them only on the right side. So it makes sense to have the same here I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we support the right-side only, it seems be useful to print explicit error messages like left-side literal not supported ....?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also enforce this is the syntax, like here: https://github.com/apache/spark/pull/20999/files#diff-8c1cb2af4aa1109e08481dae79052cc3R269
|
@dongjoon-hyun @maropu @mgaido91 Could you review this PR? I think this command is a pretty useful to end users. |
|
Test build #89023 has finished for PR 19691 at commit
|
|
retest this please |
|
ok |
| throw new ParseException("Invalid partition filter specification", ctx) | ||
| } | ||
| } | ||
| if(parts.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't be better to return the Seq[Expression] as it is? Later we need it like that (in listPartitionsByFilter ) and in this way we can avoid using null which is a good thing too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why aren't we returning parts? this if seems pretty useless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. I will change this.
| } | ||
| }.distinct | ||
|
|
||
| if (normalizedSpecs.isEmpty && partitionSet.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can,t we just return partitionSet ++ normalizedSpecs ? I think it is wrong to use intersect, we should drop all of them, shouldn't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 I tried this command in hive. And hive only dropped the intersection of two partition filter.
| case EqualNullSafe(_, _) => | ||
| throw new ParseException("'<=>' operator is not allowed in partition specification.", ctx) | ||
| case cmp @ BinaryComparison(UnresolvedAttribute(name :: Nil), constant: Literal) => | ||
| cmp.withNewChildren(Seq(AttributeReference(name, StringType)(), constant)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to pass all the type of literals here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, we might need tests for non int-literal cases.
| case EqualNullSafe(_, _) => | ||
| throw new ParseException("'<=>' operator is not allowed in partition specification.", ctx) | ||
| case cmp @ BinaryComparison(UnresolvedAttribute(name :: Nil), constant: Literal) => | ||
| cmp.withNewChildren(Seq(AttributeReference(name, StringType)(), constant)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the partition column is not of String type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll work on this these days.
|
Test build #89029 has finished for PR 19691 at commit
|
|
@DazhuangSu are you still working on this? |
|
@mgaido91 Sorry, a little busy recently. |
|
thanks @DazhuangSu |
|
Test build #91308 has finished for PR 19691 at commit
|
|
Test build #91352 has finished for PR 19691 at commit
|
|
Test build #91393 has finished for PR 19691 at commit
|
|
Test build #91473 has finished for PR 19691 at commit
|
| } | ||
| val dataType = table.partitionSchema.apply(attrName).dataType | ||
| expr.withNewChildren(Seq(AttributeReference(attrName, dataType)(), | ||
| Cast(constant, dataType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add the cast only when needed, ie. dataType != constant.dataType?
| extractFromPartitionSpec(partition._1, table, resolver) | ||
| } else if (!partition._1.isEmpty && !partition._2.isEmpty) { | ||
| // This drop condition has both partitionSpecs and expressions. | ||
| extractFromPartitionFilter(partition._2, catalog, table, resolver).intersect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be quite inefficient if we have a lot if partitions. What about converting the partitionSpec is EqualsTo expressions and add them as conditions? It would be great IMO if we can achieve this by enforcing in the syntax that we have either all partitionSpecs or all expressions. So if we have all partition = value, we have a partitionSpec, while if at least one is a comparison different from =, we have all expressions (including the =s). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. And the hard part may be how to convert a partitionSpec to an EqualsTo.
I think it's better to let the AstBuilder to handle this. If so, we may have to have two AlterTableDropPartitionCommand instances in ddl.scala, one for all partitionSpec and one for all expression.
But it maybe a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? Isn't it enough something like:
((partitionVal (',' partitionVal)*) | (expression (',' expression)*))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean how to define AlterTableDropPartitionCommand better in ddl.scala. need to handle both
AlterTableDropPartitionCommand( tableName: TableIdentifier, partitions: Seq[Seq[Expression]], ifExists: Boolean, purge: Boolean, retainData: Boolean)
and
AlterTableDropPartitionCommand( tableName: TableIdentifier, partitions: Seq[TablePartitionSpec], ifExists: Boolean, purge: Boolean, retainData: Boolean)
Maybe telling the different cases inside the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can (must) just have a single: AlterTableDropPartitionCommand( tableName: TableIdentifier, partitionSpecs: Seq[TablePartitionSpec], partitionExprs: Seq[Seq[Expression]], ifExists: Boolean, purge: Boolean, retainData: Boolean). Indeed, we might have something like:
alter table foo drop partition (year=2017, month=12), partition(year=2018, month < 3);
where we have both a partition spec and an expression specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @mgaido91 there is one problem after I changed the syntax,
when i run sql DROP PARTITION (p >=2) it throws
org.apache.spark.sql.AnalysisException: cannot resolve 'p' given input columns: []
I'm trying to find a way to figure it out.
By the way, is a syntax like ((partitionVal (',' partitionVal)*) | (expression (',' expression)*)) legal? Because I wrote a antlr4 syntax test, but it didn't work as I supposed.
Besides, I was wrong that day. I think the if conditions won't be inefficient if there is a lot of partitions. it maybe inefficient if there are a lot of dropPartitionSpec which I don't think can happen easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DazhuangSu sorry I missed your last comment somehow.
Why do you say it would not be inefficient if you have a lot of partitions?I think it would be! Imagine that you partition per year and day. And you want to get the first 6 months of this year. The spec would be something like (year = 2018, day < 2018-07-01). Imagine we have a 10 years history. With the current implementation, we would get back basically all the the partitions from the filter, ie. roughly 3.650 and then it will intersect those. Anyway, my understanding is that such a case would not even work properly, as it would try drop the intersect of:
Seq(Seq("year"-> "2018", "day" -> "2018-01-01", ...)).intersect(Seq(Map("year"->"2018")))
which would result in an empty Seq, so we would drop nothing. Moreover, I saw no test for this case in the tests. Can we add tests for this use case and can we add support for it if my understanding that it is not working is right? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 I understand your point, yes it would be inefficient. I will work on this soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @DazhuangSu
|
ok to test |
|
Test build #93052 has finished for PR 19691 at commit
|
|
Could someone merge this please ? :) |
|
@DazhuangSu Can you resolve the conflict? |
|
@maropu ok |
|
@HyukjinKwon can you trigger again? |
|
@DazhuangSu are you still working on this? There is this comment and also another nit which need to be addressed from the last review... Meanwhile I am not sure if someone else has other comments on this. |
|
ok to test |
|
Could anyone take over this then? |
|
@DazhuangSu Are u there? |
|
Test build #95451 has finished for PR 19691 at commit
|
|
if @DazhuangSu is not active anymore on this I can take it over, but let's wait for his answer. |
|
@mgaido91 |
|
@DazhuangSu still busy? |
|
@maropu |
|
ok @mgaido91 can u take this over? |
|
@DazhuangSu @maropu sure, thanks, I'll submit a PR for this soon. Thanks. |
Closes apache#21766 Closes apache#21679 Closes apache#21161 Closes apache#20846 Closes apache#19434 Closes apache#18080 Closes apache#17648 Closes apache#17169 Add: Closes apache#22813 Closes apache#21994 Closes apache#22005 Closes apache#22463 Add: Closes apache#15899 Add: Closes apache#22539 Closes apache#21868 Closes apache#21514 Closes apache#21402 Closes apache#21322 Closes apache#21257 Closes apache#20163 Closes apache#19691 Closes apache#18697 Closes apache#18636 Closes apache#17176 Closes apache#23001 from wangyum/CloseStalePRs. Authored-by: Yuming Wang <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
This pr is inspired by @dongjoon-hyun.
quote from #15704 :
#16036 points out that if we use int literal in DROP PARTITION will fail after patching #15704.
The reason of this failing in #15704 is that AlterTableDropPartitionCommand tells BinayComparison and EqualTo with following code:
private def isRangeComparison(expr: Expression): Boolean = {expr.find(e => e.isInstanceOf[BinaryComparison] && !e.isInstanceOf[EqualTo]).isDefined }This PR resolve this problem by telling a drop condition when parsing sqls.
How was this patch tested?
New testcase introduced from #15704