Skip to content

Support query pass-through for JDBC-based connectors#12325

Merged
findepi merged 6 commits intotrinodb:masterfrom
kasiafi:342JDBCRemoteQuery
Jun 3, 2022
Merged

Support query pass-through for JDBC-based connectors#12325
findepi merged 6 commits intotrinodb:masterfrom
kasiafi:342JDBCRemoteQuery

Conversation

@kasiafi
Copy link
Member

@kasiafi kasiafi commented May 10, 2022

This PR introduces query Polymorphic Table Function for JDBC-based connectors:

  • Druid
  • MemSql
  • MySql
  • Oracle
  • PostgreSql
  • Redshift
  • SqlServer
  • MariaDB
  • SingleStore

not yet implemented:

  • ClickHouse
  • Phoenix

@Override
public JdbcTableHandle getTableHandle(ConnectorSession session, PreparedQuery preparedQuery)
{
return delegate.getTableHandle(session, preparedQuery); // TODO add caching?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should. @kokosing thoughts?

(note: PreparedQuery defines value-based equals, so ok from this perspective)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to cache it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added caching + test in TestCachingJdbcClient.

}
}
catch (SQLException e) {
throw new TrinoException(JDBC_ERROR, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some message about what operation failed

Optional.empty(),
OptionalLong.empty(),
Optional.of(columns.build()),
ImmutableSet.of(), // TODO return other tables referenced by the query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't be able to do this, since query is opaque.

Instead of TODO add a // Note that we're not doing that.

AFAIR, this field is used for cache eviction. I think we should make io.trino.plugin.jdbc.JdbcTableHandle#otherReferencedTables and Optional field (so that we can mark "unknown" state) so cache eviction remains correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

Copy link
Member Author

@kasiafi kasiafi May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed an issue #12526, and added a link in the code along with the note.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fine to do it as a follow-up.

columns -> groupKey -> verify(columns.contains(groupKey),
"applyAggregation called with a grouping column %s which was not included in the table columns: %s",
groupKey,
tableColumns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change, separate commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted. The formatter changed its mind.

@Override
public JdbcTableHandle getTableHandle(ConnectorSession session, PreparedQuery preparedQuery)
{
return delegate().getTableHandle(session, preparedQuery); // TODO add stats?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, do add stats

public class RemoteQuery
implements Provider<ConnectorTableFunction>
{
public static final String NAME = "remote_query";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think native_query would be a better name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think remote_query is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i am not convinced. every query is "remote" in some case.
"native" is sometimes used to mean "uninterpreted" by current layer, like java.sql.Connection#nativeSQL or "native query" in Hibernate, so users can be familiar with the term

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kasiafi @kokosing @hashhar @wendigo @findepi
did we reach a conclusion here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martint suggested query for consideration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this name. Let's roll with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hashhar @wendigo @kokosing are you ok with just query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the function to "query", and the related classes to Query*. Left "NativeQuery" in test method names.

Comment on lines +66 to +67
// TODO wrap in ClassLoaderSafeConnectorTableFunction? (see also TestClassLoaderSafeWrappers)
return new RemoteQueryFunction(transactionManager);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply the TODO; i think base-jdbc -based connectors use ClassLoaderSafe* wrappers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do after ConnectorTableFunction is refactored into an interface (#12531).

@findepi
Copy link
Member

findepi commented May 11, 2022

cc @hashhar @kokosing @wendigo

@hashhar hashhar self-requested a review May 11, 2022 11:17
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed

public class RemoteQuery
implements Provider<ConnectorTableFunction>
{
public static final String NAME = "remote_query";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think remote_query is better.

@Override
public JdbcTableHandle getTableHandle(ConnectorSession session, PreparedQuery preparedQuery)
{
return delegate.getTableHandle(session, preparedQuery); // TODO add caching?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to cache it.

{
binder.bind(JdbcClient.class).annotatedWith(ForBaseJdbc.class).to(MariaDbClient.class).in(Scopes.SINGLETON);
binder.install(new DecimalModule());
newSetBinder(binder, ConnectorTableFunction.class).addBinding().toProvider(RemoteQuery.class).in(Scopes.SINGLETON);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to do it in trino-base-jdbc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not applicable to some connectors (clickhouse)

Copy link
Contributor

@wendigo wendigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore my comments if they don't make sense :)

@Override
public JdbcTableHandle getTableHandle(ConnectorSession session, PreparedQuery preparedQuery)
{
ImmutableList.Builder<JdbcColumnHandle> columns = ImmutableList.builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move columns builder inside of the try and use builderWithExpectedSize(metaData.getColumnCount()) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require moving the return statement inside the try block.

ImmutableList.Builder<JdbcColumnHandle> columns = ImmutableList.builder();
try (Connection connection = connectionFactory.openConnection(session);
PreparedStatement preparedStatement = queryBuilder.prepareStatement(this, session, connection, preparedQuery)) {
ResultSetMetaData metaData = preparedStatement.getMetaData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just metadata

try (Connection connection = connectionFactory.openConnection(session);
PreparedStatement preparedStatement = queryBuilder.prepareStatement(this, session, connection, preparedQuery)) {
ResultSetMetaData metaData = preparedStatement.getMetaData();
if (metaData == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireNonNull(preparedStatement.getMetaData(), "ResultSetMetaData not provided for query")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer UnsupportedOperationException and a more verbose message.

Optional.empty(),
OptionalLong.empty(),
Optional.of(columns.build()),
ImmutableSet.of(), // TODO return other tables referenced by the query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

public void configure(Binder binder)
{
binder.bind(JdbcClient.class).annotatedWith(ForBaseJdbc.class).to(DruidJdbcClient.class).in(Scopes.SINGLETON);
newSetBinder(binder, ConnectorTableFunction.class).addBinding().toProvider(RemoteQuery.class).in(Scopes.SINGLETON);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that we should expose this PTF only on some config toggle (disabled by default)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is always registered, but it is disabled by default by access control.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for clarification

@hashhar hashhar removed their request for review May 13, 2022 05:54
@lhofhansl
Copy link
Member

What's the specific aim for this? I see multiple variation:

  1. Allow using specific features of the connector databases.
  2. Have the ability to push more work into the connector.
  3. Enable connectors earlier (and add more general capabilities later)
  4. More?

For (2) and (3) I'd suggest we generally improve what/how Trino can push down - as has happened in the past few months.
If (1), can we have some examples?

Not trying to question this (wouldn't be my place anyway), just curious.

@findepi
Copy link
Member

findepi commented May 16, 2022

This can be viewed as an escape hatch for features we don't have pushdown support yet, including features and syntaxes that exists in a remote database but do not exist in Trino.

I'd suggest we generally improve what/how Trino can push down

Agreed. This is not the goal for this PR, but it remains a goal for the project.

@lhofhansl
Copy link
Member

This can be viewed as an escape hatch for features we don't have pushdown support yet,

Big fan. We have a very fast internal analytical database that we sometimes access via a custom connector based on base-jdbc. Pushing down vs. not is often a 100x performance difference, so have having full control over this is great!

@wendigo
Copy link
Contributor

wendigo commented May 17, 2022

@lhofhansl same here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare the param in uppercase, so that it can be specified without quotes (...( query => '....')) upon the invocation.

@kasiafi kasiafi force-pushed the 342JDBCRemoteQuery branch 2 times, most recently from 11d9f57 to af5417e Compare May 20, 2022 15:36
@lhofhansl
Copy link
Member

Tested this with our internal database connector, and it works great.

The use of PreparedStatement.getMetaData() is interesting (I had to add this to the internal JDBC driver). In general not all databases do that well/fast. The PreparedStatement interface even warns about that.

Is there a way to do this after the statement is executed so that we already have a ResultSet?

@kasiafi kasiafi force-pushed the 342JDBCRemoteQuery branch from af5417e to fabdb91 Compare May 24, 2022 08:49
@findepi
Copy link
Member

findepi commented May 24, 2022

Is there a way to do this after the statement is executed so that we already have a ResultSet?

The metadata needs to be available on the coordinator during query planning, while the final execution should happen on the worker.

We could have a mode where we simply execute the query and leverage ResultSet's metadata, but this would require

  • pinning the execution on the coordinator (what about node-scheduler.include-coordinator=false?)
  • special connection lifecycle so that planner and execution have same Connection with ResultSet already opened. Perhaps this would need to go in ConnectorTableHandle, but we still would want to ensure C and RS are closed in case of query being aborted, or TableScan pruned away. Engine currently doesn't provide any lifecycle for ConnectorTableHandle objects.

In general not all databases do that well/fast. The PreparedStatement interface even warns about that.

@lhofhansl Do you know for which databases it's fast and for which it isn't?

@lhofhansl
Copy link
Member

@lhofhansl Do you know for which databases it's fast and for which it isn't?

I do not, sorry. :(
But I can see that some DBs do not pre-compile the statement before it is executed (does not always involve a PREPARE).

As an example for the internal DB I'm "playing" with, I'm wrapping the query with select * from (<original query>) where false (in that DBs JDBC Driver's PreparedStatement.getMetaData()) to get the metadata. That happens to work well in this particular case (the planner is smart enough not to do any work even if the inner query has a GROUP BY, ORDER BY etc).
In any case it does involve an extra roundtrip to the DB.

Not a big deal. I think JDBC is "funny" here.

@kasiafi kasiafi force-pushed the 342JDBCRemoteQuery branch 2 times, most recently from 6c1a665 to b110e47 Compare May 26, 2022 14:30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think preparedQuery.getQuery() should go into exception message.
it can be super-long.

you can append firstNonNull(e.getMessage(), e) to the message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. In this case, I'll limit the expected message in tests to the common part: Failed to get table handle for prepared query to avoid overriding.

@kasiafi kasiafi force-pushed the 342JDBCRemoteQuery branch 3 times, most recently from bf7a6ef to 533dc83 Compare May 26, 2022 20:49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Majority of overrides include the schema name. If I didn't add it here, I'd have to override the method that uses this test table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be getsession().getschema().orthrow()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other implementations also put tpch explicitly.

@kasiafi kasiafi force-pushed the 342JDBCRemoteQuery branch from 533dc83 to 5d8e141 Compare May 27, 2022 13:10
@lhofhansl
Copy link
Member

Is this merge-ready?

@kasiafi kasiafi force-pushed the 342JDBCRemoteQuery branch from 5d8e141 to 5cf2960 Compare June 3, 2022 06:06
@findepi findepi merged commit 117b2e7 into trinodb:master Jun 3, 2022
@github-actions github-actions bot added this to the 384 milestone Jun 3, 2022
@kasiafi kasiafi mentioned this pull request Jun 3, 2022
@colebow
Copy link
Member

colebow commented Jun 3, 2022

Do we need docs for this?

edit: Manfred informed me we're already on this

// TODO https://github.com/trinodb/trino/issues/12526: invalidate tableHandlesByNameCache for handles derived from opaque queries
invalidateColumnsCache(schemaTableName);
invalidateCache(tableHandlesByNameCache, key -> key.tableName.equals(schemaTableName));
tableHandlesByQueryCache.invalidateAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks too coarse. For every table invalidation (with any cause) all caches for handles by query are invalidated.

Apart from that I've noticed that the only usage of
BaseJdbcClient.getTableHandle(ConnectorSession, PreparedQuery) (io.trino.plugin.jdbc)
is for
visitTableFunctionInvocation(TableFunctionInvocation, Optional<Scope>) in Visitor in StatementAnalyzer (io.trino.sql.analyzer)

image

So this test (evolved since initial PR) is not really precise as it tests real tables (but should not), but not the virtual ones.

        dropTable(phantomTable); // not via CachingJdbcClient

        assertThatThrownBy(() -> jdbcClient.getTableHandle(SESSION, query))
                .hasMessageContaining("Failed to get table handle for prepared query");

        assertCacheStats(cachingJdbcClient, TABLE_HANDLES_BY_QUERY_CACHE)
                .hits(1)
                .afterRunning(() -> {
                    assertThat(cachingJdbcClient.getTableHandle(SESSION, query))
                            .isEqualTo(cachedTable);
                    assertThat(cachingJdbcClient.getColumns(SESSION, cachedTable))
                            .hasSize(0); // phantom_table has no columns
                });

// ...

        cachingJdbcClient.createTable(SESSION, new ConnectorTableMetadata(phantomTable, emptyList()));

        assertCacheStats(cachingJdbcClient, TABLE_HANDLES_BY_QUERY_CACHE)
                .misses(1)
                .loads(1)  //                          <<<------------------ this has to be 0, and not tested on real table
                .afterRunning(() -> {
                    assertThat(cachingJdbcClient.getTableHandle(SESSION, query))
                            .isEqualTo(cachedTable);
                    assertThat(cachingJdbcClient.getColumns(SESSION, cachedTable))
                            .hasSize(0); // phantom_table has no columns
                });

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks too coarse. For every table invalidation (with any cause) all caches for handles by query are invalidated.

Can we do better?

So this test (evolved since initial PR) is not really precise as it tests real tables (but should not), but not the virtual ones.

i am not sure i follow. Do you want to make a PR with a fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying figuring out what's going on and this is my findings. Actually my thoughts.
Looking for any comments which would allow me to shape that knowledge.

alee-x added a commit to SwanseaUniversityMedical/trino-db2 that referenced this pull request Jan 6, 2023
Updated Trino version to 405.

Added support for query pass-through from Trino to the connected DB2 instance. See trinodb/trino#12325 for more details about the implementation of the ``query`` Polymorphic Table Function for JDBC-based connectors in Trino.

The functionality is useful when dealing with very large tables on DB2 as Trino does not need to pull entire tables across. It also allows the use of DB2-native functions that are not supported in Trino.
bearrito added a commit to bearrito/trino-mcap that referenced this pull request Jan 8, 2026
…ation

Problem:
- Table functions fail to register when S3 config is present
- Root cause: non-standard connector initialization (direct instantiation vs Guice)
- No configuration validation framework

Solution - Phase 1: Guice Integration:
- Add McapModule with Guice bindings for table functions
- Update McapConnectorFactory to use Bootstrap/Injector pattern
- Update McapConnector to receive table functions via constructor injection
- Add @Inject annotation to McapMetadataFunction constructor
- Create McapConfig class (placeholder for future MCAP-specific config)
- Bind FileAccessor instance to injector for dependency injection

Changes:
- McapConnectorFactory: Uses Bootstrap with JsonModule and McapModule
- McapConnector: Now accepts Set<ConnectorTableFunction> in constructor
- McapModule: Binds table functions using Multibinder pattern
- McapMetadataFunction: Uses @Inject for Guice instantiation
- Tests: Updated to pass table functions to connector constructor

Benefits:
- Configuration validation happens during injector initialization
- Table functions properly bound via Guice Multibinder
- Follows standard Trino connector pattern (Hive, Iceberg, Delta Lake)
- Clear error messages for configuration issues
- Foundation for FileSystemModule integration (phase 2)

Test Results: All unit tests pass (16/16)
- TestMcapPlugin: ✅ Connector creation with Guice
- TestMcapConnector: ✅ Table function registration
- TestMcapMetadataFunction: ✅ Function analysis and schema

Next Phase: FileSystemModule integration and TrinoFileSystem migration

Refs: openspec/changes/fix-s3-table-function-registration
See: trinodb/trino#12325 (JDBC table function example)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Support pass-through queries for the jdbc based connectors

7 participants