-
Notifications
You must be signed in to change notification settings - Fork 648
docs: add docs for DuckDB extension #5578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,41 +1,216 @@ | ||
| # DuckDB | ||
|
|
||
| In Python, Lance datasets can also be queried with [DuckDB](https://duckdb.org/), | ||
| an in-process SQL OLAP database. This means you can write complex SQL queries to analyze your data in Lance. | ||
|
|
||
| This integration is done via [DuckDB SQL on Apache Arrow](https://duckdb.org/docs/guides/python/sql_on_arrow), | ||
| which provides zero-copy data sharing between LanceDB and DuckDB. | ||
| DuckDB is capable of passing down column selections and basic filters to Lance, | ||
| reducing the amount of data that needs to be scanned to perform your query. | ||
| Finally, the integration allows streaming data from Lance tables, | ||
| allowing you to aggregate tables that won't fit into memory. | ||
| All of this uses the same mechanism described in DuckDB's | ||
| blog post *[DuckDB quacks Arrow](https://duckdb.org/2021/12/03/duck-arrow.html)*. | ||
|
|
||
| A `LanceDataset` is accessible to DuckDB through the Arrow compatibility layer directly. | ||
| To query the resulting Lance dataset in DuckDB, | ||
| all you need to do is reference the dataset by the same name in your SQL query. | ||
| Lance datasets can be queried in SQL with [DuckDB](https://duckdb.org/), | ||
| an in-process OLAP relational database. Using DuckDB means you can write complex SQL queries (that may not yet be supported in Lance), without needing to move your data out of Lance. | ||
|
|
||
| !!! note | ||
| This integration is done via a DuckDB extension, whose source code is available | ||
| [here](https://github.com/lance-format/lance-duckdb). | ||
| To ensure you see the latest examples and syntax, check out the | ||
| [DuckDB extension](https://duckdb.org/community_extensions/extensions/lance) | ||
| documentation page. | ||
|
|
||
| ## Usage: Python | ||
|
|
||
| ### Install dependencies | ||
|
|
||
| Install Lance, DuckDB and Pyarrow and follow the examples below. | ||
|
|
||
| ```bash | ||
| pip install pylance duckdb pyarrow | ||
| ``` | ||
|
|
||
| ### Add data to a Lance dataset | ||
|
|
||
| Let's add some data to a Lance dataset. | ||
|
|
||
| ```python | ||
| import duckdb # pip install duckdb | ||
| import lance | ||
| import pyarrow as pa | ||
|
|
||
| data = [ | ||
| {"animal": "duck", "noise": "quack", "vector": [0.9, 0.7, 0.1]}, | ||
| {"animal": "horse", "noise": "neigh", "vector": [0.3, 0.1, 0.5]}, | ||
| {"animal": "dragon", "noise": "roar", "vector": [0.5, 0.2, 0.7]}, | ||
| ] | ||
| pa_table = pa.Table.from_pylist(data) | ||
|
|
||
| lance_path = "./lance_duck.lance" | ||
| ds = lance.write_dataset(pa_table, lance_path, mode="overwrite") | ||
| ``` | ||
|
|
||
| This will store the Lance dataset to the specified local path. | ||
|
|
||
| ### Install Lance extension in DuckDB | ||
|
|
||
| Install the Lance extension in DuckDB as follows. | ||
|
|
||
| ds = lance.dataset("./my_lance_dataset.lance") | ||
|
|
||
| duckdb.query("SELECT * FROM ds") | ||
| # ┌─────────────┬─────────┬────────┐ | ||
| # │ vector │ item │ price │ | ||
| # │ float[] │ varchar │ double │ | ||
| # ├─────────────┼─────────┼────────┤ | ||
| # │ [3.1, 4.1] │ foo │ 10.0 │ | ||
| # │ [5.9, 26.5] │ bar │ 20.0 │ | ||
| # └─────────────┴─────────┴────────┘ | ||
|
|
||
| duckdb.query("SELECT mean(price) FROM ds") | ||
| # ┌─────────────┐ | ||
| # │ mean(price) │ | ||
| # │ double │ | ||
| # ├─────────────┤ | ||
| # │ 15.0 │ | ||
| # └─────────────┘ | ||
| ```python | ||
| import duckdb | ||
|
|
||
| duckdb.sql( | ||
| """ | ||
| INSTALL lance FROM community; | ||
| LOAD lance; | ||
| """ | ||
| ) | ||
| ``` | ||
|
|
||
| ### Query a `*.lance` path | ||
|
|
||
| You're now ready to query the Lance dataset using SQL! | ||
|
|
||
| ```python | ||
| # Get results from Lance in DuckDB! | ||
| r1 = duckdb.sql( | ||
| """ | ||
| SELECT * | ||
| FROM './lance_duck.lance' | ||
| LIMIT 5; | ||
| """ | ||
| ) | ||
| print(r1) | ||
| ``` | ||
| This returns: | ||
| ``` | ||
| ┌─────────┬─────────┬─────────────────┐ | ||
| │ animal │ noise │ vector │ | ||
| │ varchar │ varchar │ double[] │ | ||
| ├─────────┼─────────┼─────────────────┤ | ||
| │ duck │ quack │ [0.9, 0.7, 0.1] │ | ||
| │ horse │ neigh │ [0.3, 0.1, 0.5] │ | ||
| │ dragon │ roar │ [0.5, 0.2, 0.7] │ | ||
| └─────────┴─────────┴─────────────────┘ | ||
| ``` | ||
|
|
||
| ???+ info "Query S3 paths directly" | ||
| You can also query `s3://` paths directly. To do this, you can use DuckDB's native secrets mechanism to provide credentials. | ||
|
|
||
| ```sql | ||
| r1 = duckdb.sql( | ||
| """ | ||
| CREATE SECRET (TYPE S3, provider credential_chain); | ||
|
|
||
| SELECT * | ||
| FROM 's3://bucket/path/to/dataset.lance' | ||
| LIMIT 5; | ||
| """ | ||
| ) | ||
| ``` | ||
|
|
||
| ### Search | ||
|
|
||
| The extension exposes lance_search(...) as a unified entry point for: | ||
|
|
||
| - Vector search (KNN / ANN) | ||
| - Full-text search (FTS) | ||
| - Hybrid search (vector + FTS) | ||
|
|
||
| !!! warning | ||
| DuckDB treats `column` as a keyword in some contexts. It's recommended to | ||
| use `text_column` / `vector_column` as column names for the Lance extension. | ||
|
|
||
| #### Vector search | ||
|
|
||
| You can perform vector search on a column. This returns the `_distance` | ||
| (smaller is closer, so sort in ascending order for nearest neighbors). | ||
|
|
||
| ```python | ||
| # Show results similar to "the duck goes quack" | ||
| q2 = [0.8, 0.7, 0.2] | ||
|
|
||
| r2 = duckdb.sql( | ||
| """ | ||
| SELECT animal, noise, vector | ||
| FROM lance_vector_search( | ||
| './lance_duck.lance', | ||
| 'vector', | ||
| q2::FLOAT[], | ||
| k = 1, | ||
| prefilter = true | ||
| ) | ||
| ORDER BY _distance ASC; | ||
| """ | ||
| ) | ||
| print(r2) | ||
| ``` | ||
| This returns: | ||
| ``` | ||
| ┌─────────┬─────────┬─────────────────┐ | ||
| │ animal │ noise │ vector │ | ||
| │ varchar │ varchar │ double[] │ | ||
| ├─────────┼─────────┼─────────────────┤ | ||
| │ duck │ quack │ [0.9, 0.7, 0.1] │ | ||
| └─────────┴─────────┴─────────────────┘ | ||
| ``` | ||
|
|
||
| #### Full-text search (FTS) | ||
|
|
||
| Run keyword-based BM25 search as shown below. This returns a `_score`, which | ||
| is sorted in descending order to get the most relevant results. | ||
|
|
||
| ```python | ||
| # Show results for the query "the brave knight faced the dragon" | ||
| r3 = duckdb.sql( | ||
| """ | ||
| SELECT animal, noise, vector | ||
| FROM lance_fts( | ||
| './lance_duck.lance', | ||
| 'animal', | ||
| 'the brave knight faced the dragon', | ||
| k = 1, | ||
| prefilter = true) | ||
| ORDER BY _score DESC; | ||
| """ | ||
| ) | ||
| print(r3) | ||
| ``` | ||
| This returns: | ||
| ``` | ||
| ┌─────────┬─────────┬─────────────────┐ | ||
| │ animal │ noise │ vector │ | ||
| │ varchar │ varchar │ double[] │ | ||
| ├─────────┼─────────┼─────────────────┤ | ||
| │ dragon │ roar │ [0.5, 0.2, 0.7] │ | ||
| └─────────┴─────────┴─────────────────┘ | ||
| ``` | ||
|
|
||
| #### Hybrid search | ||
|
|
||
| Hybrid search combines vector and FTS scores, returning a `_hybrid_score` in addition | ||
| to `_distance` / `_score`. To get the most relevant results, sort in descending order. | ||
|
|
||
| ```python | ||
| # Show results similar to "the duck surprised the dragon" | ||
| q4 = [0.8, 0.7, 0.2] | ||
|
|
||
| r4 = duckdb.sql( | ||
| """ | ||
| SELECT animal, noise, vector | ||
| FROM lance_hybrid_search( | ||
| './lance_duck.lance', | ||
| 'vector', q4::FLOAT[], | ||
| 'animal', 'the duck surprised the dragon', | ||
| k = 2, | ||
| prefilter = true | ||
| ) | ||
| ORDER BY _hybrid_score DESC; | ||
| """ | ||
| ) | ||
| print(r4) | ||
| ``` | ||
| This should give: | ||
| ``` | ||
| ┌─────────┬─────────┬─────────────────┐ | ||
| │ animal │ noise │ vector │ | ||
| │ varchar │ varchar │ double[] │ | ||
| ├─────────┼─────────┼─────────────────┤ | ||
| │ duck │ quack │ [0.9, 0.7, 0.1] │ | ||
| │ dragon │ roar │ [0.5, 0.2, 0.7] │ | ||
| └─────────┴─────────┴─────────────────┘ | ||
| ``` | ||
|
|
||
| ## Usage: DuckDB CLI | ||
|
|
||
| DuckDB comes with a CLI that makes it easy to run SQL queries in the terminal. | ||
| Check out the [DuckDB extension](https://duckdb.org/community_extensions/extensions/lance) documentation page for examples using the DuckDB CLI. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search APIs have been changed a lot, need a update.