[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete #30562

aokolnychyi · 2020-12-01T19:24:05Z

What changes were proposed in this pull request?

This PR provides us with a way to check if a data source is going to reject the delete via deleteWhere at planning time.

Why are the changes needed?

The only way to support delete statements right now is to implement SupportsDelete. According to its Javadoc, that interface is meant for cases when we can delete data without much effort (e.g. like deleting a complete partition in a Hive table).

This PR actually provides us with a way to check if a data source is going to reject the delete via deleteWhere at planning time instead of just getting an exception during execution. In the future, we can use this functionality to decide whether Spark should rewrite this delete and execute a distributed query or it can just pass a set of filters.

Consider an example of a partitioned Hive table. If we have a delete predicate like part_col = '2020', we can just drop the matching partition to satisfy this delete. In this case, the data source should return true from canDeleteWhere and use the filters it accepts in deleteWhere to drop the partition. I consider this as a delete without significant effort. At the same time, if we have a delete predicate like id = 10, Hive tables would not be able to execute this delete using a metadata only operation without rewriting files. In that case, the data source should return false from canDeleteWhere and we should use a more sophisticated row-level API to find out which records should be removed (the API is yet to be discussed, but we need this PR as a basis).

If we decide to support subqueries and all delete use cases by simply extending the existing API, this will mean all data sources will have to implement a lot of Spark logic to determine which records changed. I don't think we want to go that way as the Spark logic to determine which records should be deleted is independent of the underlying data source. So the assumption is that Spark will execute a plan to find which records must be deleted for data sources that return false from canDeleteWhere.

Does this PR introduce any user-facing change?

Yes but it is backward compatible.

How was this patch tested?

This PR comes with a new test.

aokolnychyi · 2020-12-01T19:25:58Z

cc @holdenk @dongjoon-hyun @dbtsai @rdblue @cloud-fan @viirya @sunchao

dongjoon-hyun · 2020-12-01T20:43:09Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

+   * Rows should be deleted from the data source iff all of the filter expressions match.
+   * That is, the expressions must be interpreted as a set of filters that are ANDed together.
+   * <p>
+   * Spark will call this method to check if the delete is possible without significant effort.


Although there is an explanation about this, without significant effort looks like a misleading assumption here. Some Data source can implement this by themself with significant efforts due to its own reason. Can we omit this?

This comes from the documentation for deleteWhere:

Implementations may reject a delete operation if the delete isn't possible without significant effort. For example, . . .

I think that phrasing is a bit more clear because it uses "implementations may reject", so what constitutes "significant effort" is determined by the implementation.

I think a clearer way to say it here is to refer to that standard: "Spark will call this method to check whether deleteWhere would reject the delete operation because it requires significant effort."

It would also help to have more context: this is for some sources to determine whether or not a metadata delete can be performed.

I think a clearer way to say it here is to refer to that standard: "Spark will call this method to check whether deleteWhere would reject the delete operation because it requires significant effort."

This sounds better as there is a standard between canDeleteWhere and deleteWhere.

So canDeleteWhere is a much light-weight approach to know deleteWhere will reject a delete operation without actually calling deleteWhere.

I agree with you, @dongjoon-hyun @rdblue. I'll update the comment.

So canDeleteWhere is a much light-weight approach to know deleteWhere will reject a delete operation without actually calling deleteWhere.

Yes, @viirya. It actually provides us with a way to check if a data source is going to reject the delete via deleteWhere at planning time instead of just getting an exception during execution. In the future, we can use this functionality to decide whether Spark should rewrite this delete and execute a distributed query or it can just pass a set of filters.

Consider an example of a partitioned Hive table. If we have a delete predicate like part_col = '2020', we can just drop the matching partition to satisfy this delete. In this case, the data source should return true from canDeleteWhere and use the filters it accepts in deleteWhere to drop the partition. I consider this as a delete without significant effort. At the same time, if we have a delete predicate like id = 10, Hive tables would not be able to execute this delete using a metadata only operation without rewriting files. In that case, the data source should return false from canDeleteWhere and we should use a more sophisticated row-level API to find out which records should be removed (the API is yet to be discussed, but we need this PR as a basis).

If we decide to support subqueries and all delete use cases by simply extending the existing API, this will mean all data sources will have to implement a lot of Spark logic to determine which records changed. I don't think we want to go that way as the Spark logic to determine which records should be deleted is independent of the underlying data source. So the assumption is that Spark will execute a plan to find which records must be deleted for data sources that return false from canDeleteWhere.

My worry is that we may change the API back and forth if we don't have a clear big picture. Right now this patch is not useful as it only changes where we throw the exception, but I can see that this will be useful when we have the row-level delete API and we can use the canDeleteWhere to decide if we want to use the row-level API or not.

This is exactly the reason for adding this API. It is a step toward rewriting plans for row-level DELETE and MERGE operations. The current deleteWhere exception approach happens while running the physical plan, when it is too late to rewrite the plan for row-level changes. Adding the canDeleteWhere check fixes that problem.

Since you can easily see how it will be used, what is the concern about adding this?

My concern is not about this PR itself, but about the order of commits. It seems more natural to have the row-level delete API first, and then do this change. At that time we can have tests to verify if we can switch correctly.

If you are working with @aokolnychyi on this feature, and you two think this is better for your development, please go ahead and merge it.

Thanks, we do think that it is helpful to have it in this order. This was an early problem that we ran into and the solution is clear.

Can you guys just file related JIRAs for that before merging? That's all I asked.

Just to update this thread, JIRAs were created. Here is a summary.

dongjoon-hyun · 2020-12-01T20:44:59Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

+   * @return true if the delete operation can be performed
+   */
+  default boolean canDeleteWhere(Filter[] filters) {
+    return true;


Shall we have false as a safer default?

I'm wondering if there is a breaking change which we should have true here?

Unfortunately, this would change the assumptions for existing implementations. Right now, if this interface is implemented, Spark will call deleteWhere for the delete. Returning false would cause Spark to skip it where the return of this method is used.

The original idea was to try to delete using deleteWhere, and if that fails to run a more expensive delete. But when we started implementing the more expensive delete, we needed to know during job planning, not job execution, whether the metadata-only delete can be done. This method solves that problem.

Got it, @rdblue .

That's correct and the method returns true to keep the old behavior by default.

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

rdblue · 2020-12-01T21:01:32Z

Looks good overall.

viirya · 2020-12-01T21:24:06Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

+  default boolean canDeleteWhere(Filter[] filters) {
+    return true;
+  }
+


deleteWhere only possibly rejects a delete operation, even the delete isn't possible without significant effort.

If canDeleteWhere returns false, does it mean deleteWhere definitely reject the delete operation?

I think that would be the case. In our use case, we can tell whether a delete is aligned with partitioning for this check. But, we can also scan through data to determine whether files themselves are fully matched (or not matched) by the filter. We would do the partitioning check here and the more expensive stats-based check in deleteWhere.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

sunchao · 2020-12-01T21:33:18Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

              }).toArray
+
+          if (!table.asDeletable.canDeleteWhere(filters)) {
+            throw new AnalysisException(


is this exception handled later? the rewrite part for row deletion is a TBD?

The rewrite would happen earlier. This just throws a good error message if deleteWhere will fail.

The rewrite part is yet to be done. This PR just adds a way to have more info at planning time. Specifically, we will know if the rewrite is needed.

I see, thanks. So this method will be called in an earlier place and before rewrite once the rewrite part is ready, is that right?

It is going to be called at planning time to check if we should apply the rewrite or just pass filters down.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

SparkQA · 2020-12-02T14:26:52Z

Test build #132039 has finished for PR 30562 at commit 85ea2c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. With @rdblue 's and @aokolnychyi 's explanation, the updated patch looks clear to me now. Thank you, @aokolnychyi and @rdblue .

Could you review once more, @holdenk @dbtsai @cloud-fan @viirya @sunchao ?

dongjoon-hyun · 2020-12-02T16:51:38Z

Also, cc @gatorsmile . Please let us know if you have some comments .

aokolnychyi · 2020-12-02T20:19:30Z

I've updated the doc per @sunchao suggestion. Let me know if there are any other open questions.

SparkQA · 2020-12-02T21:05:59Z

Test build #132070 has finished for PR 30562 at commit 8b909f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

Thanks @aokolnychyi ! the updated doc looks great to me. I'm LTGM (non-binding) on this PR.

aokolnychyi · 2020-12-02T21:11:11Z

The recent test failures do not seem related to this PR.

HyukjinKwon · 2020-12-02T23:48:38Z

Hey, please don't ignore my opinion, and file related JIRAs (#30562 (comment)).

dongjoon-hyun · 2020-12-03T00:59:50Z

Of course, he will do, @HyukjinKwon . We are still under reviews. In addition to that, is this (#30562 (comment)) resolved?

HyukjinKwon · 2020-12-03T01:02:22Z

Thanks @dongjoon-hyun. Yes. It's okay to edit the JIRAs later, and I understand the design doc is in progress. I just wanted to make sure we all know what to do after this.

aokolnychyi · 2020-12-03T08:48:06Z

Filing JIRAs is something I will do for sure before this one is merged, @HyukjinKwon. I was referring to any open points related to this PR and whether we have enough consensus and everybody is ok with the change. If anybody still has unresolved concerns, let's discuss until we all agree.

HyukjinKwon · 2020-12-03T09:21:26Z

I am okay with this PR. I have no unresolved concerns except that I prefer to know and make sure we have a plan. Easiest way should be to file relevant JIRAs.

aokolnychyi · 2020-12-03T09:28:40Z

I've created SPARK-33642 as a parent JIRA and 3 substasks for DELETE/UPDATE/MERGE. The parent one is where the design doc should be and where the discussion should happen.

I don't see any open points on this PR now but I propose we wait a bit more for additional feedback to make sure everybody is on the same page.

dongjoon-hyun · 2020-12-03T17:12:14Z

Thank you all! I'll merge this for Apache Spark 3.1.

[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete

83a7731

dongjoon-hyun reviewed Dec 1, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala Outdated Show resolved Hide resolved

rdblue approved these changes Dec 1, 2020

View reviewed changes

github-actions bot added the SQL label Dec 1, 2020

viirya reviewed Dec 1, 2020

View reviewed changes

sunchao reviewed Dec 1, 2020

View reviewed changes

HyukjinKwon reviewed Dec 2, 2020

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java Outdated Show resolved Hide resolved

Review comments

85ea2c9

dongjoon-hyun approved these changes Dec 2, 2020

View reviewed changes

Update doc

8b909f3

sunchao approved these changes Dec 2, 2020

View reviewed changes

dongjoon-hyun closed this in aa13e20 Dec 3, 2020

izchen mentioned this pull request Jan 4, 2022

Hotfix: Attach Test annotation to testDeleteFromTableAtSnapshot apache/iceberg#3799

Closed

izchen mentioned this pull request Jan 4, 2022

Spark 3.0: Fix delete from snapshot of table apache/iceberg#3840

Merged

[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete #30562

[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete #30562

Uh oh!

Conversation

aokolnychyi commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

aokolnychyi commented Dec 1, 2020

Uh oh!

dongjoon-hyun Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented Dec 1, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Dec 2, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 2, 2020

Uh oh!

aokolnychyi commented Dec 2, 2020

Uh oh!

SparkQA commented Dec 2, 2020

aokolnychyi commented Dec 1, 2020 •

edited

Loading

dongjoon-hyun Dec 1, 2020 •

edited

Loading

aokolnychyi Dec 2, 2020 •

edited

Loading

aokolnychyi Dec 2, 2020 •

edited

Loading

aokolnychyi Dec 2, 2020 •

edited

Loading