Skip to content

Support metadata DELETE in JDBC connectors#6287

Merged
kokosing merged 2 commits intotrinodb:masterfrom
kokosing:origin/master/269_jdbc_delete
Jul 2, 2021
Merged

Support metadata DELETE in JDBC connectors#6287
kokosing merged 2 commits intotrinodb:masterfrom
kokosing:origin/master/269_jdbc_delete

Conversation

@kokosing
Copy link
Copy Markdown
Member

@kokosing kokosing commented Dec 10, 2020

Support metadata DELETE in JDBC connectors

Fixes #5275

@cla-bot cla-bot bot added the cla-signed label Dec 10, 2020
@kokosing kokosing added the WIP label Dec 10, 2020
@kokosing kokosing marked this pull request as draft December 10, 2020 14:54
@kokosing
Copy link
Copy Markdown
Member Author

Fixes #5275

@findepi
Copy link
Copy Markdown
Member

findepi commented Dec 11, 2020

How does this interact with predicate pushdown?

@kokosing
Copy link
Copy Markdown
Member Author

How does this interact with predicate pushdown?

It is based on predicate pushdown. If predicate is fully pushed down, then we can call DELETE. Otherwise it fails to run DELETE. It is called metadata delete.

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from 86c4765 to 44da5b6 Compare December 11, 2020 14:05
@kokosing kokosing marked this pull request as ready for review December 11, 2020 14:05
@kokosing kokosing removed the WIP label Dec 12, 2020
@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from 44da5b6 to e5fa7d1 Compare December 13, 2020 09:30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is additionalPredicate? Is it somehow connected with limit? If not, why don't you need it in delete?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not. Connector when creates a splits then it can use additionalPredicate so it can run JDBC queries in parallel. One split where key_column % 2 = 0 and other when key_column % 2 = 1 where you have 2 splits.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case you cannot delete something which does not exist (or can because it could somehow exist?), and it safe not to provide additionalPredicate for delete. However consider situation if you can not match indexes in underlying systems which had been created with such additionalPredicate conditions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not executed for delete. buildSql is called only for SELECT. DELETE is using buildDeleteSql there is no additionalPredicate

Copy link
Copy Markdown
Contributor

@ssheikin ssheikin Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant exactly that place (I started this discussion here because there is no additionalPredicate in code you added for DELETE).

Let me please clarify my comment.
Imagine situation when you have some data (key_column, year, name) in table A.
And there are 2 splits: key_column % 2 = 0 and other when key_column % 2 = 1.
Usually user executes selects on A: select name from A where year = <year> which translates to two with splits:

select name from A where key_column % 2 = 0 and year = <year>
select name from A where key_column % 2 = 1 and year = <year>

To optimize these queries downstream system has CREATE INDEX a_idx ON a (key_column, year)
If you don't propagate additionalPredicate to DELETE this could end up with a_idx is not used for DELETE.
Could you please check if such scenario is covered?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't propagate additionalPredicate to DELETE this could end up with a_idx is not used for DELETE.

I don't care about it. It does not have to be well performant.

Notice that for this metadata delete there are no splits generated, hence there is no additional predicate nor even actual query execution in Presto. It all happens during the "planning".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additionalPredicate is not about performance, but about correctness.
anyway, this is split's attribute, and no splits are involved at this point yet, afaict

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be not null when beginDelete throws an exception?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beginDelete is not called if this can be done by metadata delete (applyDelete)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if supportsDelete=true why this test never assertQuerySucceeds("DELETE FROM "?
Could you please add such scenario? The same is for other connectors.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment below, which should answer your question

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// PostgreSql connector currently does not support row-by-row delete

Thanks for comment.
So if supportsDelete=true and connector currently does not support row-by-row delete I'd expect that this connector supports some other type of delete. I see you added testDeleteWithSimplePredicate and testDeleteEntireTable. Maybe it worth to rename this method to something like testDeleteRowByRow?

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from e5fa7d1 to c63ddb6 Compare December 15, 2020 11:27
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertQueryFails("DELETE FROM " + tableName + " WHERE orderkey % 2 = 0", "Not supported delete");
// SqlServer connector currently does not support row-by-row delete
assertQueryFails("DELETE FROM " + tableName + " WHERE orderkey % 2 = 0", "Not supported delete");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// PostgreSql connector currently does not support row-by-row delete

Thanks for comment.
So if supportsDelete=true and connector currently does not support row-by-row delete I'd expect that this connector supports some other type of delete. I see you added testDeleteWithSimplePredicate and testDeleteEntireTable. Maybe it worth to rename this method to something like testDeleteRowByRow?

Copy link
Copy Markdown
Contributor

@ssheikin ssheikin Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant exactly that place (I started this discussion here because there is no additionalPredicate in code you added for DELETE).

Let me please clarify my comment.
Imagine situation when you have some data (key_column, year, name) in table A.
And there are 2 splits: key_column % 2 = 0 and other when key_column % 2 = 1.
Usually user executes selects on A: select name from A where year = <year> which translates to two with splits:

select name from A where key_column % 2 = 0 and year = <year>
select name from A where key_column % 2 = 1 and year = <year>

To optimize these queries downstream system has CREATE INDEX a_idx ON a (key_column, year)
If you don't propagate additionalPredicate to DELETE this could end up with a_idx is not used for DELETE.
Could you please check if such scenario is covered?

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from c63ddb6 to 8542efc Compare December 19, 2020 09:02
Copy link
Copy Markdown
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, but there are some parts I do not understand.

Comment on lines 138 to 147
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like common code which could be unified. Thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that I have already unified this by extracting applyParameters method. Do you have any idea what else could be done here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is the commit is missing. Why?

@findepi findepi added the enhancement New feature or request label Dec 19, 2020
@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from 8542efc to 6a8e391 Compare January 18, 2021 13:25
Copy link
Copy Markdown
Member Author

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi comments addressed

Comment on lines 138 to 147
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that I have already unified this by extracting applyParameters method. Do you have any idea what else could be done here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beginDelete is not called if this can be done by metadata delete (applyDelete)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added io.trino.testing.AbstractTestDistributedQueries#testDeleteWithSimplePredicate that is executed for mysql and covers very simple predicate pushdown.

Notice that I added here predicate with random. I don't want to hardcode what is supported and what extend. Delete is supported if predicate pushdown is supported. Predicate pushdown support is changing over time, so such test would need to be updated frequently.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have to be overriden for all JDBC connectors, test testDelete requires delete row by row which JDBC connector won't have for a while.

Maybe I could add supportsRowByRowDelete()? (I don't like it)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was falling that one cannot call commit when autocommit=true. Now I set autocommit explicitly so I can remove explicit commit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple domain is not used to delete row-by-row, by rather to remove by predicate. By row-by-row I mean to delete query execution in Trino where we collect ids of rows to be deleted and then we remove them (beginDelete and finishDelete methods).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it default for JDBC?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks it is not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to separate openConnection to two methods: openReadOnlyConnection(readonly=true, autocommit=false) and openConnectionForWrite(readonly=false, autocommit=true). But not in this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it default for JDBC?

It looks it is not.

it actually is, but Phoenix and ClickHouse have it wrong. Addressed in #7875

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this test set up an ACID table?
i guess file metastore is not supported for ACID, right? cc @djsstarburst

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, that is correct @findepi

@djsagain
Copy link
Copy Markdown
Member

There are two APIs for deletion in ConnectorMetadata:

  • applyDelete() and executeDelete(). In connectors like the Hive and Iceberg, this pair is used for "metadata delete", which has been interpreted as deletion of entire partitions. The dogma is that applyDelete() should only return a non-empty result if the WHERE clause matches entire partitions.
  • beginDelete() and finishDelete(). This is the API intended for row-by-row delete. As of today, it's only supported by the Hive and Raptor connectors.

Don't blame me for the goofiness of the two APIs - - both pairs existed before I touched the code :)

AFAICT what you've implemented is row-by-row deletion. So perhaps you should use the beginDelete()/finishDelete() APIs.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jan 20, 2021

AFAICT what you've implemented is row-by-row deletion. So perhaps you should use the beginDelete()/finishDelete() APIs.

in hive the distinction is needed because dropping partitions is very different operation than "deleting rows from files".

the distinction is not as sound in JDBC worlds. Conceptually, you could consider every row to be its own partition, since there is "an API call" which you can remove the data, without reading it.
(the API call would be issuing DELETE to the remote system)

@kokosing
Copy link
Copy Markdown
Member Author

@djsstarburst Thanks for taking a look!

applyDelete() and executeDelete(). In connectors like the Hive and Iceberg, this pair is used for "metadata delete", which has been interpreted as deletion of entire partitions.

I implemented here "metadata delete".

The dogma is that applyDelete() should only return a non-empty result if the WHERE clause matches entire partitions.

That is my case here. Notice that there is no concept of partitions in JDBC connectors. Here applyDelete() should only return a non-empty result if the entire predicate is pushed down. PushDeleteIntoConnector requires that there is nothing between table scan and delete.

@djsagain
Copy link
Copy Markdown
Member

in hive the distinction is needed because dropping partitions is very different operation than "deleting rows from files".

That is my case here. Notice that there is no concept of partitions in JDBC connectors.

I see your point, @findepi and @kokosing. Objection withdrawn.

@kokosing
Copy link
Copy Markdown
Member Author

Automation hit #6723

@kokosing
Copy link
Copy Markdown
Member Author

@ksobolew can you please take a look?

@kokosing kokosing requested a review from ksobolew May 10, 2021 10:09
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bring this back. This check was verifying whether ! hasBehavior(SUPPORTS_DELETE) is declared correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest reverting this change.
hasBehavior should allow us to drive the reusable test without need to split it up.
the need to split up the test indicates that testing behaviors are not defined/plugged ideally yet

Copy link
Copy Markdown
Contributor

@ksobolew ksobolew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have much to add to what others said. Looks good in general

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from 0ffb179 to e4d26b4 Compare May 31, 2021 13:14
Copy link
Copy Markdown
Member Author

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi AC

I removed the TestingConnectorBehavior.SUPPORTS_METADATA_DELETE. It is not possible currently to test if delete was pushed down or not.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot, as the other method is using SchemaTableName and this is using JdbcTableHandle.

Using always version with SchemaTableName is suboptimal when data change event does not affect entire table.

I think that onDataChanged is too ambitious. It does not fit well with parameter JdbcTableHandle as it does not invalidate data for entire table like one could expect when data is changed. Unless "data" term also refers to "metadata"? I think the easiest is to rename it is to invalidateTableStatistics and do not pretend it is anything more than that.

WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was written in a way you want. And it was failing for certain connectors. Now I don't remember the exact problem. Will do it suggested way and try to capture differences in connectors.

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch 2 times, most recently from 2758d03 to 3409fa0 Compare June 24, 2021 10:58
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant, because we did not disable autocommit

since this is the only different between this & super, remove override

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment, it is required. Otherwise PSQL does not commit the delete.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it only reorg of test cases?

move to separate commit

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from 3409fa0 to 2cd8429 Compare June 25, 2021 11:17
Copy link
Copy Markdown
Member Author

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssheikin @ksobolew @Praveen2112 Can you please take a look again, a lot changed since your last review.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment, it is required. Otherwise PSQL does not commit the delete.

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch 3 times, most recently from a8e8533 to 3583b10 Compare June 27, 2021 17:34
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated comments?
maybe just remove them?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It is copied from other not supported test for deletes. Delete in Iceberg is special.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only testMetadataDelete (without star) in codebase and in this pr.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we already discussed it before, but I'm still not convinced to a check where the result is compared with 0.
Up to you.

Some connector might have partial support for DELETE statement.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add support for correlated sub query ... So that we could ensure it works for corelated joins too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is out of the scope of this PR. Surely it won't be supported for jdbc connectors.

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from 3583b10 to 4f22747 Compare June 29, 2021 08:46
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does testMetadataDeleteWithVarcharEquality live?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be testDeleteWithVarcharEqualityPredicate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only testMetadataDelete (without star) in codebase and in this pr.

@kokosing
Copy link
Copy Markdown
Member Author

I see only testMetadataDelete (without star) in codebase and in this pr.

Yes, and there is testMetadataDeleteSimple. The star is in wrong place.

@kokosing kokosing force-pushed the origin/master/269_jdbc_delete branch from 4f22747 to 2d9a113 Compare June 29, 2021 09:44
@lhofhansl
Copy link
Copy Markdown
Member

Tried with Phoenix in real setup. Works perfectly.

@kokosing
Copy link
Copy Markdown
Member Author

kokosing commented Jul 2, 2021

CI hit: #8457

@kokosing kokosing merged commit 825e6d8 into trinodb:master Jul 2, 2021
@kokosing kokosing deleted the origin/master/269_jdbc_delete branch July 2, 2021 07:28
@kokosing kokosing mentioned this pull request Jul 2, 2021
11 tasks
public void testDeleteWithBigintEqualityPredicate()
{
if (!hasBehavior(SUPPORTS_DELETE)) {
assertQueryFails("DELETE FROM region WHERE regionkey = 1", "This connector does not support deletes");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing In case of connectors where the test infra is shared this can lead to test data being deleted.

Or is the assumption that if the hasBehavior is implemented correctly it shouldn't matter since the actual test performs deletions against copies of tables?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm submitting a PR to change this to follow the approach in testDeleteWithVarcharEqualityPredicate where an explicit TestTable is created for the negative test case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hashhar if you're concerned about that, there are other cases like that in tests, where negative test cases are "dangerous".

(i was not concerned as much as you are ...)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg

assertQueryFails("ALTER TABLE nation RENAME TO other_schema.yyyy", "This connector does not support renaming tables across schemas");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

Support DELETE in JDBC connectors

8 participants