Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
243 changes: 209 additions & 34 deletions docs/src/integrations/duckdb.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,216 @@
# DuckDB

In Python, Lance datasets can also be queried with [DuckDB](https://duckdb.org/),
an in-process SQL OLAP database. This means you can write complex SQL queries to analyze your data in Lance.

This integration is done via [DuckDB SQL on Apache Arrow](https://duckdb.org/docs/guides/python/sql_on_arrow),
which provides zero-copy data sharing between LanceDB and DuckDB.
DuckDB is capable of passing down column selections and basic filters to Lance,
reducing the amount of data that needs to be scanned to perform your query.
Finally, the integration allows streaming data from Lance tables,
allowing you to aggregate tables that won't fit into memory.
All of this uses the same mechanism described in DuckDB's
blog post *[DuckDB quacks Arrow](https://duckdb.org/2021/12/03/duck-arrow.html)*.

A `LanceDataset` is accessible to DuckDB through the Arrow compatibility layer directly.
To query the resulting Lance dataset in DuckDB,
all you need to do is reference the dataset by the same name in your SQL query.
Lance datasets can be queried in SQL with [DuckDB](https://duckdb.org/),
an in-process OLAP relational database. Using DuckDB means you can write complex SQL queries (that may not yet be supported in Lance), without needing to move your data out of Lance.

!!! note
This integration is done via a DuckDB extension, whose source code is available
[here](https://github.com/lance-format/lance-duckdb).
To ensure you see the latest examples and syntax, check out the
[DuckDB extension](https://duckdb.org/community_extensions/extensions/lance)
documentation page.

## Usage: Python

### Install dependencies

Install Lance, DuckDB and Pyarrow and follow the examples below.

```bash
pip install pylance duckdb pyarrow
```

### Add data to a Lance dataset

Let's add some data to a Lance dataset.

```python
import duckdb # pip install duckdb
import lance
import pyarrow as pa

data = [
{"animal": "duck", "noise": "quack", "vector": [0.9, 0.7, 0.1]},
{"animal": "horse", "noise": "neigh", "vector": [0.3, 0.1, 0.5]},
{"animal": "dragon", "noise": "roar", "vector": [0.5, 0.2, 0.7]},
]
pa_table = pa.Table.from_pylist(data)

lance_path = "./lance_duck.lance"
ds = lance.write_dataset(pa_table, lance_path, mode="overwrite")
```

This will store the Lance dataset to the specified local path.

### Install Lance extension in DuckDB

Install the Lance extension in DuckDB as follows.

ds = lance.dataset("./my_lance_dataset.lance")

duckdb.query("SELECT * FROM ds")
# ┌─────────────┬─────────┬────────┐
# │ vector │ item │ price │
# │ float[] │ varchar │ double │
# ├─────────────┼─────────┼────────┤
# │ [3.1, 4.1] │ foo │ 10.0 │
# │ [5.9, 26.5] │ bar │ 20.0 │
# └─────────────┴─────────┴────────┘

duckdb.query("SELECT mean(price) FROM ds")
# ┌─────────────┐
# │ mean(price) │
# │ double │
# ├─────────────┤
# │ 15.0 │
# └─────────────┘
```python
import duckdb

duckdb.sql(
"""
INSTALL lance FROM community;
LOAD lance;
"""
)
```

### Query a `*.lance` path

You're now ready to query the Lance dataset using SQL!

```python
# Get results from Lance in DuckDB!
r1 = duckdb.sql(
"""
SELECT *
FROM './lance_duck.lance'
LIMIT 5;
"""
)
print(r1)
```
This returns:
```
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ double[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
│ horse │ neigh │ [0.3, 0.1, 0.5] │
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
```

???+ info "Query S3 paths directly"
You can also query `s3://` paths directly. To do this, you can use DuckDB's native secrets mechanism to provide credentials.

```sql
r1 = duckdb.sql(
"""
CREATE SECRET (TYPE S3, provider credential_chain);

SELECT *
FROM 's3://bucket/path/to/dataset.lance'
LIMIT 5;
"""
)
```

### Search
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search APIs have been changed a lot, need a update.


The extension exposes lance_search(...) as a unified entry point for:

- Vector search (KNN / ANN)
- Full-text search (FTS)
- Hybrid search (vector + FTS)

!!! warning
DuckDB treats `column` as a keyword in some contexts. It's recommended to
use `text_column` / `vector_column` as column names for the Lance extension.

#### Vector search

You can perform vector search on a column. This returns the `_distance`
(smaller is closer, so sort in ascending order for nearest neighbors).

```python
# Show results similar to "the duck goes quack"
q2 = [0.8, 0.7, 0.2]

r2 = duckdb.sql(
"""
SELECT animal, noise, vector
FROM lance_vector_search(
'./lance_duck.lance',
'vector',
q2::FLOAT[],
k = 1,
prefilter = true
)
ORDER BY _distance ASC;
"""
)
print(r2)
```
This returns:
```
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ double[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
└─────────┴─────────┴─────────────────┘
```

#### Full-text search (FTS)

Run keyword-based BM25 search as shown below. This returns a `_score`, which
is sorted in descending order to get the most relevant results.

```python
# Show results for the query "the brave knight faced the dragon"
r3 = duckdb.sql(
"""
SELECT animal, noise, vector
FROM lance_fts(
'./lance_duck.lance',
'animal',
'the brave knight faced the dragon',
k = 1,
prefilter = true)
ORDER BY _score DESC;
"""
)
print(r3)
```
This returns:
```
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ double[] │
├─────────┼─────────┼─────────────────┤
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
```

#### Hybrid search

Hybrid search combines vector and FTS scores, returning a `_hybrid_score` in addition
to `_distance` / `_score`. To get the most relevant results, sort in descending order.

```python
# Show results similar to "the duck surprised the dragon"
q4 = [0.8, 0.7, 0.2]

r4 = duckdb.sql(
"""
SELECT animal, noise, vector
FROM lance_hybrid_search(
'./lance_duck.lance',
'vector', q4::FLOAT[],
'animal', 'the duck surprised the dragon',
k = 2,
prefilter = true
)
ORDER BY _hybrid_score DESC;
"""
)
print(r4)
```
This should give:
```
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ double[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
```

## Usage: DuckDB CLI

DuckDB comes with a CLI that makes it easy to run SQL queries in the terminal.
Check out the [DuckDB extension](https://duckdb.org/community_extensions/extensions/lance) documentation page for examples using the DuckDB CLI.