Skip to content

[WIP]Presto Lance Connector#22749

Open
beinan wants to merge 1 commit intoprestodb:masterfrom
beinan:lance_connector
Open

[WIP]Presto Lance Connector#22749
beinan wants to merge 1 commit intoprestodb:masterfrom
beinan:lance_connector

Conversation

@beinan
Copy link
Copy Markdown
Member

@beinan beinan commented May 14, 2024

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add LanceDB connector :pr:`22749`

@beinan beinan requested a review from a team as a code owner May 14, 2024 23:31
@beinan beinan requested a review from presto-oss May 14, 2024 23:31
@tdcmeehan tdcmeehan self-assigned this May 15, 2024
@steveburnett
Copy link
Copy Markdown
Contributor

Suggest add PR number to the release note entry, as follows:

== RELEASE NOTES ==

General Changes
* Add LanceDB connector :pr:`22749`

@tdcmeehan
Copy link
Copy Markdown
Contributor

This might be orthogonal, but does it also make sense to add integration into the Lance file format? I am just thinking of ML pipeline observability/debugging.

@beinan beinan force-pushed the lance_connector branch 2 times, most recently from c0fc68b to 25d0b0d Compare May 19, 2024 03:06
@beinan
Copy link
Copy Markdown
Member Author

beinan commented May 19, 2024

This might be orthogonal, but does it also make sense to add integration into the Lance file format? I am just thinking of ML pipeline observability/debugging.

Good point! I'm thinking integrating Lance's format with Prestissimo could be more natural, considering Lance already provides an arrow-based interface.
This PR might be more focus on metadata and query planning part

@steveburnett
Copy link
Copy Markdown
Contributor

Understanding this is currently WIP, will documentation be added for this new connector?

@tdcmeehan
Copy link
Copy Markdown
Contributor

@beinan how will the metadata work? Does it integrate with an existing catalog like HMS, or does it require a custom catalog?

@beinan beinan force-pushed the lance_connector branch 2 times, most recently from 182d954 to 9a70421 Compare June 9, 2024 07:20
@majetideepak
Copy link
Copy Markdown
Collaborator

Does it integrate with an existing catalog like HMS, or does it require a custom catalog?

I am curious about this as well. It will help to add an RFC with some of these details as LanceDB is getting popular for ML/AI use cases. It will also help implement Prestissimo support.

@beinan beinan force-pushed the lance_connector branch from 9a70421 to acb52af Compare June 10, 2024 22:08
@jackye1995
Copy link
Copy Markdown

Does it integrate with an existing catalog like HMS, or does it require a custom catalog?

Just an update that we are developing a catalog standard Lance Catalog in order to better integrate with systems like HMS. Would appreciate feedback from the Presto community to see how we can best integrate!

@steveburnett
Copy link
Copy Markdown
Contributor

New update for the release note entry: we've automated the link to the PR and updated the Release Notes Guidelines so the manual addition of the PR link is no longer needed.

== RELEASE NOTES ==

General Changes
* Add LanceDB connector.

Also, I suggest add documentation for the new connector in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/connector.

beinan added a commit that referenced this pull request Mar 12, 2026
## Description
Add a new Presto connector for
[LanceDB](https://lancedb.github.io/lance/), a columnar data format
optimized for ML/AI workloads built on Apache Arrow. This connector
enables Presto to read from and write to Lance datasets using the
lance-core Java SDK.

The implementation is similar to the
[lance-trino](https://github.com/lance-format/lance-trino) connector,
adapted to Presto's SPI conventions.

Key components:
- **Read path**: Split-per-fragment model with Arrow-to-Presto page
conversion via `LanceArrowToPageScanner`
- **Write path**: Presto page-to-Arrow conversion with fragment-based
commit protocol via `LancePageSink`
- **Metadata**: Directory-based namespace with schema discovery from
Lance datasets
- **Type support**: Boolean, Tinyint, Smallint, Integer, Bigint, Real,
Double, Varchar, Varbinary, Date, Timestamp, Array

## Motivation and Context
LanceDB is an emerging columnar format designed for ML/AI vector data
workloads. Adding a Presto connector allows users to query Lance
datasets using standard SQL, enabling integration with existing data
infrastructure.

Continues the work from #22749.

## Impact
New connector plugin — no changes to existing Presto code. Adds the
\`presto-lance\` module to the build.

**New configuration properties:**

| Property | Default | Description |
|---|---|---|
| \`lance.root-url\` | (required) | Lance root storage path |
| \`lance.impl\` | \`dir\` | Namespace implementation: \`dir\` or full
class name |
| \`lance.single-level-ns\` | \`true\` | Access 1st level namespace with
virtual \`default\` schema |
| \`lance.read-batch-size\` | \`8192\` | Number of rows per batch during
reads |
| \`lance.write-batch-size\` | \`10000\` | Number of rows to batch
before writing to Arrow |
| \`lance.max-rows-per-file\` | \`1000000\` | Maximum number of rows per
Lance file |
| \`lance.max-rows-per-group\` | \`100000\` | Maximum number of rows per
row group |

## Test Plan
All unit tests pass: \`./mvnw test -pl presto-lance\`

## Contributor checklist

- [x] Please make sure your submission complies with our [contributing
guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md),
in particular [code
style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style)
and [commit
standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards).
- [x] PR description addresses the issue accurately and concisely. If
the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax,
functions, or other functionality.
- [x] If release notes are required, they follow the [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines).
- [x] Adequate tests were added if applicable.
- [x] CI passed.
- [x] If adding new dependencies, verified they have an [OpenSSF
Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or
higher (or obtained explicit TSC approval for lower scores).

## Release Notes

```
== RELEASE NOTES ==

Lance Connector Changes
* Add :doc:`/connector/lance` for reading and writing LanceDB datasets.
```

---------

Co-authored-by: Beinan Wang <beinan_wang@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants