Add support for external engines #84

ryannedolan · 2025-01-15T00:04:33Z

Summary

This adds support for installing remote query engines, e.g. Trino, DuckDB, or Flink SQL Gateway.

Added Engine CRD.
Added k8s.engines metadata table.
Added RemoteTableScan, RemoteJoin, associated optimizer rules.

Details

The Hoptimator JDBC Driver is able to talk to remote Databases, but it previously relied on Calcite's Enumerable engine to process queries locally. For example, joining tables in two different Databases would involve first fetching the rows from each table and then joining locally in the driver itself.

With Engines, we can outsource these operations to fast, distributed query engines like Trino. Queries are sent off to the remote engine, and the Driver simply collects the results.

We don't expect remote engines to have the same "catalog" of databases and tables that Hoptimator knows about -- Hoptimator's idea of foo.bar may or may not match Trino's, for example. For this reason, we fully-specify the tables used by each query, sending DDL statements ahead of each query. To do this, we leverage the same PipelineRel mechanism that we use for deploying a pipeline. This means that the remote query will use the same connectors and configuration as an equivalent pipeline.

Testing

Without an Engine installed, a query must be processed locally via the Enumerable convention:

0: Hoptimator> explain plan for select * from ads.ad_clicks, profile.members;
PLAN  EnumerableNestedLoopJoin(condition=[true], joinType=[inner])
  JdbcToEnumerableConverter
    JdbcTableScan(table=[[ADS, AD_CLICKS]])
  JdbcToEnumerableConverter
    JdbcTableScan(table=[[PROFILE, MEMBERS]])


1 row selected (0.062 seconds)

The EnumerableNestedLoopJoin would be very slow for large datasets.

After installing an engine, we see that the query plan now involves a RemoteJoin instead:

0: Hoptimator> explain plan for select * from ads.ad_clicks, profile.members;
PLAN  RemoteToEnumerableConverter
  RemoteJoin(condition=[true], joinType=[inner])
    PipelineTableScan(table=[[ADS, AD_CLICKS]])
    PipelineTableScan(table=[[PROFILE, MEMBERS]])


1 row selected (0.041 seconds)

The RemoteJoin is able to leverage Trino or similar distributed query engines. For example, we are able to leverage our Flink session cluster via the Flink SQL gateway:

0: Hoptimator> select * from ads.ad_clicks, profile.members;
+------------------------------------------------------------------------------+
|                                             CAMPAIGN_URN                     |
+------------------------------------------------------------------------------+
| dad91062e84d5979ec74b5b0d0bf8e838158bed8de7a5f1b7aa0e90ffc45ea47767945bbe577 |
| f2dddb0778af2fbd2d79cd6950d44e26112413443dce63c2673c30949a99a469201388c84485 |
+------------------------------------------------------------------------------+

In the above results, the data is generated via Flink's datagen connector.

jogrogan

Is the planner able to distinguish when it should and shouldn't use these remote rules?

hoptimator-k8s/src/main/java/com/linkedin/hoptimator/k8s/K8sEngine.java

hoptimator-k8s/src/main/java/com/linkedin/hoptimator/k8s/K8sEngineTable.java

hoptimator-util/src/main/java/com/linkedin/hoptimator/util/planner/EngineRules.java

ryannedolan · 2025-01-15T15:28:20Z

Is the planner able to distinguish when it should and shouldn't use these remote rules?

Yes, via four mechanisms:

Whether an engine is installed or not. If there are no engines in the current namespace, the planner will just run queries locally as before.
The database(s) involved. Engines can target specific databases or all databases. If a query involves a database that isn't supported by any engine, that part of the query will fall back to local execution.
The rules only match certain types of query/sub-query. Some queries won't match and fall back.
Cost models. Right now I've hardcoded zero cost for remote queries, but we should be able to be smarter here.

Eventually we may want to add more details to the Engine CRD, e.g. to specify the engine's capabilities. That metadata could theoretically inform the planner better.

Internally, we can install different engines that target different databases, e.g. Trino can target offline while Flink targets nearline.

jogrogan

Super cool functionality

...tor-util/src/main/java/com/linkedin/hoptimator/util/planner/RemoteToEnumerableConverter.java

jogrogan reviewed Jan 15, 2025

View reviewed changes

ryannedolan force-pushed the engines branch 6 times, most recently from ad0b0c1 to eba9397 Compare January 16, 2025 19:08

ryannedolan marked this pull request as ready for review January 16, 2025 19:14

jogrogan reviewed Jan 16, 2025

View reviewed changes

jogrogan approved these changes Jan 16, 2025

View reviewed changes

ryannedolan changed the title ~~WIP: Add support for external engines~~ Add support for external engines Jan 16, 2025

ryannedolan added 6 commits January 17, 2025 15:37

Add Engine CRD

2110cfd

WIP: Add support for external engines

d86a8cc

Add support for external engines

e4aa029

Use datagen and blackhole connectors for demodb

ed5bad3

Add fat driver for integration tests

ebec4a6

Cache remote engine conventions

0075a5e

ryannedolan force-pushed the engines branch from b869263 to 0075a5e Compare January 20, 2025 18:59

ryannedolan added 2 commits January 20, 2025 13:54

Fix integration tests

9f2f20a

Drop dead code

e9afadc

ryannedolan enabled auto-merge (squash) January 20, 2025 19:57

ryannedolan merged commit 0e19953 into main Jan 20, 2025
1 check passed

ryannedolan deleted the engines branch January 20, 2025 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for external engines #84

Add support for external engines #84

ryannedolan commented Jan 15, 2025 •

edited

Loading

jogrogan left a comment

ryannedolan commented Jan 15, 2025

jogrogan left a comment

Add support for external engines #84

Add support for external engines #84

Conversation

ryannedolan commented Jan 15, 2025 • edited Loading

Summary

Details

Testing

jogrogan left a comment

Choose a reason for hiding this comment

ryannedolan commented Jan 15, 2025

jogrogan left a comment

Choose a reason for hiding this comment

ryannedolan commented Jan 15, 2025 •

edited

Loading