Skip to content

Conversation

@ebyhr
Copy link
Member

@ebyhr ebyhr commented Sep 14, 2024

Description

Add DuckDB connector. This connector will be helpful for small local use case like #23344

Fixes #18031

Release notes

# General
* Add DuckDB connector. ({issue}`18031`)

@cla-bot cla-bot bot added the cla-signed label Sep 14, 2024
@github-actions github-actions bot added the docs label Sep 14, 2024
@ebyhr ebyhr marked this pull request as ready for review September 28, 2024 06:54
@ebyhr ebyhr requested review from hashhar and wendigo October 2, 2024 00:26
@ebyhr ebyhr force-pushed the ebi/duckdb branch 3 times, most recently from 8ed5d45 to 914954c Compare November 1, 2024 03:43
@github-actions

This comment was marked as outdated.

@github-actions github-actions bot added the stale label Nov 22, 2024
@ebyhr ebyhr removed the stale label Nov 22, 2024
@ebyhr ebyhr force-pushed the ebi/duckdb branch 3 times, most recently from 68b7316 to 13afd4d Compare November 29, 2024 03:28
@github-actions github-actions bot added the stale label Dec 20, 2024
@trinodb trinodb deleted a comment from github-actions bot Dec 20, 2024
@github-actions github-actions bot removed the stale label Dec 23, 2024
@jalr4ever
Copy link

Thanks for contribution, sir. When can this PR merged and release? 🧐

@ebyhr ebyhr force-pushed the ebi/duckdb branch 3 times, most recently from 41efaa5 to 00387f2 Compare January 29, 2025 04:00
@ebyhr ebyhr requested a review from Praveen2112 January 29, 2025 04:37
Copy link
Contributor

@wendigo wendigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Since this is a new connector we can ship it in the initial form and improve it from here.

@ebyhr ebyhr merged commit 5320db4 into trinodb:master Jan 30, 2025
98 of 99 checks passed
@ebyhr ebyhr deleted the ebi/duckdb branch January 30, 2025 04:50
@github-actions github-actions bot added this to the 470 milestone Jan 30, 2025
@mosabua
Copy link
Member

mosabua commented Jan 30, 2025

Whoa! Amazing .. two new connectors in 470 at this stage ..

@StephenOTT
Copy link

@ebyhr how does the connector isolate duckdb/sandbox from the env? This page lays out several scenarios where you can use duckdb to access the underlying os: https://duckdb.org/docs/operations_manual/securing_duckdb/overview.html

@mosabua
Copy link
Member

mosabua commented Feb 18, 2025

@ebyhr how does the connector isolate duckdb/sandbox from the env? This page lays out several scenarios where you can use duckdb to access the underlying os: https://duckdb.org/docs/operations_manual/securing_duckdb/overview.html

@StephenOTT the Trino DuckDB connector just connects to an external DuckDB instance via a JDBC driver. How that DuckDB is deployed, managed, secured, and sandboxed is completely outside of the scope of the connector.

@StephenOTT
Copy link

@mosabua based on that position then it would see that the connector would expose passwords stored locally on the trino server https://duckdb.org/docs/operations_manual/securing_duckdb/overview.html#disabling-file-access

Theoretically someone would then be able to access file based catalog credentials.

The issue is not the duckdb instance storage, but rather duckDb's ability to query locations such as the local file system the duckDb instance is running on (per the securing duckdb url above)

@mosabua
Copy link
Member

mosabua commented Feb 18, 2025

No .. why would it expose passwords stored on the Trino server. DuckDB should be isolated and run separately from Trino on an entirely different server. And Trino connect to it via JDBC (network..). Data sources in Trino are never run on the same server. And yes, you probably need to secure that data source properly and how that is done varies a lot for each data source.

@ghost
Copy link

ghost commented Feb 19, 2025

DuckDB is an embedded in-process database engine similar to sqlite. There is no such thing as a server to connect to. https://duckdb.org/why_duckdb#simple

@wendigo
Copy link
Contributor

wendigo commented Feb 19, 2025

@StephenOTT you need to explicitly configure this connector in order to do anything with it, by default there is no such catalog created. This applies to all connectors.

@StephenOTT
Copy link

Understood and agreed that it is an explicit configuration. However it would seem without sandboxing it would warrant a very explicit warning in the connector docs for duckdb.
This is a similar type of scenario where you have the new python functions which are sandboxed. But imagine if they were not sandboxed.

Are there other existing connectors that provide the ability to read any file based credential on the server just by activating the catalog? (Remember that in this Scenario it is not about where the db file is storage, it is merely just the activation of the db instance that allows you to then load a local file in the OS)

@wendigo
Copy link
Contributor

wendigo commented Feb 19, 2025

@StephenOTT python functions are on by default and cannot be disabled. That's the major difference here. I'm not aware of any such connector like DuckDB.

@StephenOTT
Copy link

The easy mistake made here is someone is going to activate duckdb (through file catalog or dynamic catalogs) and assume it is "safe" in the same sense of safe as the other connectors: it provides a corridor to other systems/sources. The configurations make it appear that way as well with the duckdb db file path. But with duckdb we end up with a very specific vector that supports local OS file reading and thus by creating a duckdb instance we open the ability to expose the credentials on the os.

I want to use duckdb connector. It is a great addition. I am only raising a very real security vector that admins would be very uncomfortable with.

@mosabua
Copy link
Member

mosabua commented Feb 19, 2025

Fair enough .. the fact remains that you should NOT run DuckDB on the same instance / server as Trino. We consider all data sources as external systems and Trino does not attempt to manage them. From what I understand you cant just "activate DuckDB" by creating a catalog that uses the connector .. you still got to install and run DuckDB on your own somewhere .. and that should not be the Trino cluster nodes.

I agree with your suggestions to add some sort of warning to the documentation though. Can you send a PR @StephenOTT ? We could add that info to the Requirements section of the connector docs. Maybe we need to detail info about the path for the jdbc url being to a shared storage and add a link to the security info and concerns.

@StephenOTT
Copy link

Can you clarify? Because the jdbc driver for duckdb is the "engine":

https://duckdb.org/docs/clients/java.html

When you provide a jdbc path, you are just providing a location for the file(s)/storage of the "database". But the processing power is still at the driver level.

In DuckDbClientModule.class the DuckDbDriver is instantiated.

There is no "duckdb compute database" you are connecting to.

@mosabua
Copy link
Member

mosabua commented Feb 19, 2025

Oh .. I did not realize the JDBC driver for DuckDB is actually not just a driver.. sigh. So we definitely need to update the docs .. also because I wonder how this even works with a multi cluster node .. the path probably needs to be to a shared storage that is mounted on the same location .. but then I wonder if that works with multiple workers accessing DuckDB..

Please send a PR

@mosabua
Copy link
Member

mosabua commented Feb 25, 2025

Sent PR now #25146 .. lets discuss more there @StephenOTT

@ebyhr ebyhr mentioned this pull request Sep 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Support/connector for motherduck duckdb

6 participants