Conversation
|
Suggest add PR number to the release note entry, as follows: |
|
This might be orthogonal, but does it also make sense to add integration into the Lance file format? I am just thinking of ML pipeline observability/debugging. |
c0fc68b to
25d0b0d
Compare
Good point! I'm thinking integrating Lance's format with Prestissimo could be more natural, considering Lance already provides an arrow-based interface. |
|
Understanding this is currently WIP, will documentation be added for this new connector? |
|
@beinan how will the metadata work? Does it integrate with an existing catalog like HMS, or does it require a custom catalog? |
182d954 to
9a70421
Compare
I am curious about this as well. It will help to add an RFC with some of these details as LanceDB is getting popular for ML/AI use cases. It will also help implement Prestissimo support. |
Just an update that we are developing a catalog standard Lance Catalog in order to better integrate with systems like HMS. Would appreciate feedback from the Presto community to see how we can best integrate! |
|
New update for the release note entry: we've automated the link to the PR and updated the Release Notes Guidelines so the manual addition of the PR link is no longer needed. Also, I suggest add documentation for the new connector in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/connector. |
## Description Add a new Presto connector for [LanceDB](https://lancedb.github.io/lance/), a columnar data format optimized for ML/AI workloads built on Apache Arrow. This connector enables Presto to read from and write to Lance datasets using the lance-core Java SDK. The implementation is similar to the [lance-trino](https://github.com/lance-format/lance-trino) connector, adapted to Presto's SPI conventions. Key components: - **Read path**: Split-per-fragment model with Arrow-to-Presto page conversion via `LanceArrowToPageScanner` - **Write path**: Presto page-to-Arrow conversion with fragment-based commit protocol via `LancePageSink` - **Metadata**: Directory-based namespace with schema discovery from Lance datasets - **Type support**: Boolean, Tinyint, Smallint, Integer, Bigint, Real, Double, Varchar, Varbinary, Date, Timestamp, Array ## Motivation and Context LanceDB is an emerging columnar format designed for ML/AI vector data workloads. Adding a Presto connector allows users to query Lance datasets using standard SQL, enabling integration with existing data infrastructure. Continues the work from #22749. ## Impact New connector plugin — no changes to existing Presto code. Adds the \`presto-lance\` module to the build. **New configuration properties:** | Property | Default | Description | |---|---|---| | \`lance.root-url\` | (required) | Lance root storage path | | \`lance.impl\` | \`dir\` | Namespace implementation: \`dir\` or full class name | | \`lance.single-level-ns\` | \`true\` | Access 1st level namespace with virtual \`default\` schema | | \`lance.read-batch-size\` | \`8192\` | Number of rows per batch during reads | | \`lance.write-batch-size\` | \`10000\` | Number of rows to batch before writing to Arrow | | \`lance.max-rows-per-file\` | \`1000000\` | Maximum number of rows per Lance file | | \`lance.max-rows-per-group\` | \`100000\` | Maximum number of rows per row group | ## Test Plan All unit tests pass: \`./mvnw test -pl presto-lance\` ## Contributor checklist - [x] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [x] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [x] Adequate tests were added if applicable. - [x] CI passed. - [x] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes ``` == RELEASE NOTES == Lance Connector Changes * Add :doc:`/connector/lance` for reading and writing LanceDB datasets. ``` --------- Co-authored-by: Beinan Wang <beinan_wang@apple.com>
Description
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.