Skip to content

Implement timestamp predicate pushdown in Druid connector#8474

Closed
jerryleooo wants to merge 6 commits intotrinodb:masterfrom
jerryleooo:fix_8404
Closed

Implement timestamp predicate pushdown in Druid connector#8474
jerryleooo wants to merge 6 commits intotrinodb:masterfrom
jerryleooo:fix_8404

Conversation

@jerryleooo
Copy link
Member

@jerryleooo jerryleooo commented Jul 6, 2021

Fixes #8404

@cla-bot cla-bot bot added the cla-signed label Jul 6, 2021
@jerryleooo jerryleooo marked this pull request as draft July 6, 2021 09:08
@jerryleooo jerryleooo marked this pull request as ready for review July 7, 2021 01:21
@jerryleooo jerryleooo requested review from hashhar and sumannewton July 7, 2021 01:21
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests. It's difficult to prevent regressions or verify if it's working as expected without it.

Take a look at TestPostgreSqlTypeMapping.
You'll need to modify the tests to perform insertions using the JDBC driver instead of using Trino (since Druid connector doesn't support writes yet).

@sumannewton
Copy link
Contributor

@jerryleooo Edit the PR to Fix #8474: Pushdown timestamp filters in druid connector

@jerryleooo
Copy link
Member Author

jerryleooo commented Jul 7, 2021

Fix #8474: Pushdown timestamp filters in druid connector

OK sure but I think it should be Fix #8404: Pushdown timestamp filters in druid connector since 8474 is this PR self

@jerryleooo jerryleooo changed the title try fixing #8404 Fix #8474: Pushdown timestamp filters in druid connector Jul 7, 2021
@jerryleooo jerryleooo changed the title Fix #8474: Pushdown timestamp filters in druid connector Fix #8404: Pushdown timestamp filters in druid connector Jul 7, 2021
@hashhar hashhar changed the title Fix #8404: Pushdown timestamp filters in druid connector Implement timestamp predicate pushdown in Druid connector Jul 7, 2021
@jerryleooo
Copy link
Member Author

You'll need to modify the tests to perform insertions using the JDBC driver instead of using Trino (since Druid connector doesn't support writes yet).

JDBC driver may also not work -- in druid they ingest data with:

void ingestData(String datasource, String indexTaskFile, String dataFilePath)

@jerryleooo
Copy link
Member Author

jerryleooo commented Jul 7, 2021

Hi @hashhar some latest progress:

  1. It seems Druid is OK with the withRounding version, and for sure I need to add tests
  2. Then I found it is difficult to add tests like other JDBC connectors, since Druid doesn't support creating tables and insert data on the fly (which is needed in TestDruidTypeMapping), we can only "ingest data", I am trying to see if there is an easier way. Your advice is most welcome.
    Thanks

@hashhar
Copy link
Member

hashhar commented Jul 8, 2021

Seems like we cannot "write" data at all to Druid. Creating ingest jobs for each possible test table doesn't scale too.

@wendigo @Parth-Brahmbhatt @dheerajkulakarni Any ideas how we can work around the Druid limitations to ingest some test data without having to create TSV files and batch ingestion jobs from those?

@dheerajkulakarni
Copy link
Member

Hey @hashhar @jerryleooo , I think I have done something similar in this pull request . I created a method in which I was reading from a datasource and pushing it to other datasource with some other name.
Please check copyAndIngestTpchData ovverrided method in this file in my pull request if it helps.

@jerryleooo
Copy link
Member Author

Thank @dheerajkulakarni for the idea, however, I think it has a prerequisite -- the target data source has the same schema as the source data source, so we can just change the ioConfig part and reuse the index task file. I don't think it works for randomly generated data. (correct me if I miss something)

I guess the final solution would be generating index task file and io file, copy the io file into container, and then do the ingestion.

@jerryleooo
Copy link
Member Author

jerryleooo commented Jul 14, 2021

Hi @hashhar @findepi @dheerajkulakarni
As I have mentioned, since Druid doesn't support CREATE TABLE and INSERT DATA, it is not very easy to add the type mapping test.

I tried to add a Druid version of CreateAndInsertDataSetup and enhanced the TestTable constructor. Until now I still faced the issue of the index task failure -- in /druid/indexer/v1/task/{taskId}/status API, the runnerStatusCode is always WAITING although the status is SUCCESS. Can share your ideas if you know what's the cause.

UPDATE: temporarily solved

@jerryleooo jerryleooo marked this pull request as draft July 14, 2021 09:34
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryleooo Can we extract the testing related changes to a separate PR and get that merged first?

Thanks a lot for coming up with the approach of generating both data and the indexing tasks dynamically. Will make it possible to test and hence implement more features in Druid connector.

@jerryleooo jerryleooo force-pushed the fix_8404 branch 2 times, most recently from ac96a1c to 83d7bed Compare July 15, 2021 15:11
@jerryleooo jerryleooo marked this pull request as ready for review July 15, 2021 15:12
@jerryleooo
Copy link
Member Author

Thanks @hashhar for the review.
I am not sure if I should have another PR, since the type mapping test seems also to help prove my changes are correct.
And for sure if that makes things easier I would like to do that.
For now, I think all the functions and basic tests are done.

@hashhar
Copy link
Member

hashhar commented Jul 15, 2021

@jerryleooo That's a fair point. I agree now. But once we think everything looks good we should at-least move that into a separate commit.

I'll review this in a while.

@hashhar hashhar self-requested a review July 15, 2021 15:32
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

Let's break into commits as:

  1. The public to private change of DRUID_SCHEMA.
  2. Introduction of the Druid test setup (TestTable, CreateAndInsertDataSetup etc.)
  3. Adding explicit timestamp mapping and tests for timestamp predicate pushdown and type mapping tests for timestamp. (Let's handle other types as follow-ups to keep scope limited and easier to reason about and review).

@Test
public void testPredicatePushdown()
{
assertThat(query("SELECT * FROM orders where __time > DATE'1970-01-01'")).isFullyPushedDown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy the tests from TestPostgreSqlConnectorTest#testPredicatePushdown and remove the ones not relevant for Druid. Keeping tests consistent is useful and allows extracting them to base test classes in the future.

For __time pushdown add a separate test method since that's something specific to Druid.

@@ -35,35 +35,49 @@ public class TestTable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to introduce a new TestTable within the trino-druid module. After all other modules shouldn't be affected by Druid's limitations.

Please extract to separate class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consideration was there might be other databases that don't support creating tables and inserting data -- but you are right, this might be a premature abstraction. WIll modify.

import static java.util.stream.Collectors.joining;
import static org.assertj.core.api.Assertions.assertThat;

public final class DruidSqlDataTypeTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it take to avoid copying SqlDataTypeTest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SqlDataTypeTest relies on TestTable, in whose constructor CREATE TABLE and INSERT DATA will be run -- this is not suitable for Druid.
#8404 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 things in the game: SqlDataTypeTest, DataSetup and TestTable

  • SqlDataTypeTest delegates all the setup to DataSetup, so only this needs to be specific for Druid
  • TestTable is used in DataSetup interface
  • SqlDataTypeTest uses TestTable for two things:
    • knowing table name
    • removing table with TestTable#close (cleanup)

We should address this by introducing a new interface:

  • call it TemporaryRelation
  • with methods getRelationName and close() (extending from AutoCloseable)
  • make TestTable implement TemporaryRelation
  • use it in DataSetup interface definition
  • all current DataSetup implementations will continue to use TestTable
  • you will write a new implementation for Druid.

Would that work?

cc @hashhar

@findepi
Copy link
Member

findepi commented Jul 22, 2021

https://github.com/trinodb/trino/pull/8474/files#r673634986 seems to suggest all Druid timestamps have millisecond precision (for storage)
#8474 (comment) seems to suggest that Druid supports nanosecond precision (for queries at least).

If Druid's storage is always millisecond precision, then our life is simpler. We can easily support millisecond precision, without any rounding, with full pushdown.
We just need a column mapping for this that would

  • map to trino timestamp(3) type
  • does not round, but ensures the read value fits in millisecond precision (has zero nanoseconds-of-millisecond)
  • does full pushdown

@lkm
Copy link
Contributor

lkm commented Oct 21, 2021

Seems like this has run aground, is there anything I can help with to get it moving again? The current state of the Druid connector makes a quite limited for real-life purposes and this seems like the first thing to tackle even before aggs, which sadly is also stranded.

@jerryleooo
Copy link
Member Author

jerryleooo commented Oct 22, 2021

@lkm I will make time to finish this.

@Praveen2112
Copy link
Member

@jerryleooo Any updates on this PR ? If you are okay, can I re-work on this PR and fix the issue ?

@Praveen2112
Copy link
Member

Thanks @jerryleooo for working on this PR. I have applied your changes as a part of this PR - #13335.

@jerryleooo
Copy link
Member Author

@Praveen2112 cool bro and sorry for the late reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Druid __time filter is not pushed down

8 participants