-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for external engines #84
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the planner able to distinguish when it should and shouldn't use these remote rules?
hoptimator-k8s/src/main/java/com/linkedin/hoptimator/k8s/K8sEngine.java
Outdated
Show resolved
Hide resolved
hoptimator-k8s/src/main/java/com/linkedin/hoptimator/k8s/K8sEngineTable.java
Show resolved
Hide resolved
hoptimator-util/src/main/java/com/linkedin/hoptimator/util/planner/EngineRules.java
Outdated
Show resolved
Hide resolved
Yes, via four mechanisms:
Eventually we may want to add more details to the Internally, we can install different engines that target different databases, e.g. Trino can target offline while Flink targets nearline. |
ad0b0c1
to
eba9397
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool functionality
Summary
This adds support for installing remote query engines, e.g. Trino, DuckDB, or Flink SQL Gateway.
k8s.engines
metadata table.RemoteTableScan
,RemoteJoin
, associated optimizer rules.Details
The Hoptimator JDBC Driver is able to talk to remote
Databases
, but it previously relied on Calcite'sEnumerable
engine to process queries locally. For example, joining tables in two differentDatabases
would involve first fetching the rows from each table and then joining locally in the driver itself.With
Engines
, we can outsource these operations to fast, distributed query engines like Trino. Queries are sent off to the remote engine, and the Driver simply collects the results.We don't expect remote engines to have the same "catalog" of databases and tables that Hoptimator knows about -- Hoptimator's idea of
foo.bar
may or may not match Trino's, for example. For this reason, we fully-specify the tables used by each query, sending DDL statements ahead of each query. To do this, we leverage the samePipelineRel
mechanism that we use for deploying a pipeline. This means that the remote query will use the same connectors and configuration as an equivalent pipeline.Testing
Without an
Engine
installed, a query must be processed locally via theEnumerable
convention:The
EnumerableNestedLoopJoin
would be very slow for large datasets.After installing an engine, we see that the query plan now involves a
RemoteJoin
instead:The
RemoteJoin
is able to leverage Trino or similar distributed query engines. For example, we are able to leverage our Flink session cluster via the Flink SQL gateway:In the above results, the data is generated via Flink's
datagen
connector.