Skip to content

feat(connector): Presto Lance Connector #27185

Merged
beinan merged 11 commits intoprestodb:masterfrom
jja725:lance-connector
Mar 12, 2026
Merged

feat(connector): Presto Lance Connector #27185
beinan merged 11 commits intoprestodb:masterfrom
jja725:lance-connector

Conversation

@jja725
Copy link
Copy Markdown
Contributor

@jja725 jja725 commented Feb 21, 2026

Description

Add a new Presto connector for LanceDB, a columnar data format optimized for ML/AI workloads built on Apache Arrow. This connector enables Presto to read from and write to Lance datasets using the lance-core Java SDK.

The implementation is similar to the lance-trino connector, adapted to Presto's SPI conventions.

Key components:

  • Read path: Split-per-fragment model with Arrow-to-Presto page conversion via LanceArrowToPageScanner
  • Write path: Presto page-to-Arrow conversion with fragment-based commit protocol via LancePageSink
  • Metadata: Directory-based namespace with schema discovery from Lance datasets
  • Type support: Boolean, Tinyint, Smallint, Integer, Bigint, Real, Double, Varchar, Varbinary, Date, Timestamp, Array

Motivation and Context

LanceDB is an emerging columnar format designed for ML/AI vector data workloads. Adding a Presto connector allows users to query Lance datasets using standard SQL, enabling integration with existing data infrastructure.

Continues the work from #22749.

Impact

New connector plugin — no changes to existing Presto code. Adds the `presto-lance` module to the build.

New configuration properties:

Property Default Description
`lance.root-url` (required) Lance root storage path
`lance.impl` `dir` Namespace implementation: `dir` or full class name
`lance.single-level-ns` `true` Access 1st level namespace with virtual `default` schema
`lance.read-batch-size` `8192` Number of rows per batch during reads
`lance.write-batch-size` `10000` Number of rows to batch before writing to Arrow
`lance.max-rows-per-file` `1000000` Maximum number of rows per Lance file
`lance.max-rows-per-group` `100000` Maximum number of rows per row group

Test Plan

All unit tests pass: `./mvnw test -pl presto-lance`

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== RELEASE NOTES ==

Lance Connector Changes
* Add :doc:`/connector/lance` for reading and writing LanceDB datasets.

@jja725 jja725 requested a review from a team as a code owner February 21, 2026 05:22
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Feb 21, 2026

Reviewer's Guide

Introduces a new presto-lance connector module that integrates Presto with LanceDB, implementing metadata, splits, page source/sink, ingestion, and configuration wiring, along with a basic query runner for tests and necessary Maven dependencies.

Sequence diagram for reading data via the presto-lance connector

sequenceDiagram
    participant QE as Presto_engine
    participant LCn as LanceConnector
    participant LSM as LanceSplitManager
    participant LSS as LanceSplitSource
    participant LPSrc as LancePageSourceProvider
    participant LFPS as LanceFragmentPageSource
    participant LCl as LanceClient
    participant LDS as LanceDB_Dataset

    QE->>LCn: beginTransaction(isolationLevel, readOnly)
    LCn-->>QE: LanceTransactionHandle.INSTANCE

    QE->>LCn: getSplitManager()
    LCn-->>QE: LanceSplitManager

    QE->>LSM: getSplits(transactionHandle, session, layout)
    LSM->>LSS: new LanceSplitSource(lanceClient, table)
    loop enumerate fragments
        LSS->>LCl: getFragments(tableName)
        LCl->>LDS: open and list DatasetFragment
        LDS-->>LCl: fragments
        LCl-->>LSS: List DatasetFragment
        LSS-->>QE: LanceSplit for each FragmentInfo
    end

    QE->>LCn: getPageSourceProvider()
    LCn-->>QE: LancePageSourceProvider

    QE->>LPSrc: createPageSource(transactionHandle, session, split, layout, columns)
    LPSrc->>LCl: open(tableName)
    LCl-->>LPSrc: Dataset
    LPSrc-->>QE: LanceFragmentPageSource

    loop read pages
        QE->>LFPS: getNextPage()
        alt first batch of fragment
            LFPS->>LDS: newScan() on DatasetFragment
            LDS-->>LFPS: LanceScanner
        end
        LFPS->>LDS: scanBatches()
        LDS-->>LFPS: ArrowReader
        LFPS->>LDS: loadNextBatch()
        LDS-->>LFPS: VectorSchemaRoot
        LFPS->>LFPS: ArrowVectorPageBuilder.build() per column
        LFPS-->>QE: Page
    end

    QE->>LFPS: close()
    LFPS->>LDS: dataset.close()
Loading

Sequence diagram for writing data via the presto-lance connector

sequenceDiagram
    participant QE as Presto_engine
    participant LCn as LanceConnector
    participant LMd as LanceMetadata
    participant LPSinkProv as LancePageSinkProvider
    participant LPSink as LancePageSink
    participant LPW as LancePageWriter
    participant LCl as LanceClient
    participant LDS as LanceDB_Dataset

    QE->>LCn: beginTransaction(isolationLevel, readOnly)
    LCn-->>QE: LanceTransactionHandle.INSTANCE

    Note over QE,LMd: Table creation
    QE->>LCn: getMetadata(transactionHandle)
    LCn-->>QE: LanceMetadata
    QE->>LMd: beginCreateTable(session, tableMetadata, layout)
    LMd->>LCl: createTable(tableName, schema)
    LCl->>LDS: Dataset.create(rootAllocator, tablePath, schema)
    LDS-->>LCl: Dataset
    LCl-->>LMd: table created
    LMd-->>QE: LanceIngestionTableHandle

    Note over QE,LPSink: Data ingestion
    QE->>LCn: getPageSinkProvider()
    LCn-->>QE: LancePageSinkProvider
    QE->>LPSinkProv: createPageSink(transactionHandle, session, outputTableHandle, context)
    LPSinkProv->>LPW: use LancePageWriter
    LPSinkProv-->>QE: LancePageSink

    loop for each Page
        QE->>LPSink: appendPage(page)
        LPSink->>LPW: append(page, tableHandle, arrowSchema)
        LPW->>LCl: getTablePath(tableName)
        LPW->>LCl: getArrowRootAllocator()
        LPW->>LDS: Fragment.create(tablePath, allocator, VectorSchemaRoot)
        LDS-->>LPW: FragmentMetadata
        LPW-->>LPSink: FragmentMetadata
    end

    QE->>LPSink: finish()
    LPSink-->>QE: Collection of Slice(fragmentMetadataJson)

    Note over QE,LMd: Commit table
    QE->>LMd: finishCreateTable(session, tableHandle, fragments, stats)
    LMd->>LCl: getTableVersion(tableName)
    LCl-->>LMd: tableReadVersion
    LMd->>LCl: appendAndCommit(tableName, fragmentMetadataList, tableReadVersion)
    LCl->>LDS: Dataset.commit(appendOp, tableReadVersion)
    LDS-->>LCl: new version
    LCl-->>LMd: committed
    LMd-->>QE: Optional.empty()
Loading

Class diagram for the main presto-lance connector types

classDiagram
    class LancePlugin {
        +Iterable~ConnectorFactory~ getConnectorFactories()
    }

    class LanceConnectorFactory {
        +String getName()
        +ConnectorHandleResolver getHandleResolver()
        +Connector create(catalogName, config, context)
    }

    class LanceModule {
        +configure(binder)
    }

    class LanceConnector {
        -LanceMetadata metadata
        -LanceSplitManager splitManager
        -LancePageSourceProvider pageSourceProvider
        -LancePageSinkProvider pageSinkProvider
        +beginTransaction(isolationLevel, readOnly)
        +getMetadata(transactionHandle)
        +getSplitManager()
        +getPageSourceProvider()
        +getPageSinkProvider()
    }

    class LanceConfig {
        -String rootUrl
        +String getRootUrl()
        +LanceConfig setRootUrl(rootUrl)
    }

    class LanceClient {
        -LanceConfig config
        -Connection conn
        -RootAllocator arrowRootAllocator
        +Connection getConn()
        +List~DatasetFragment~ getFragments(tableName)
        +RootAllocator getArrowRootAllocator()
        +void createTable(tableName, schema)
        +Schema getSchema(tableName)
        +String getTablePath(tableName)
        +long appendAndCommit(tableName, fragmentMetadataList, tableReadVersion)
        +long getTableVersion(tableName)
        +Dataset open(tableName)
    }

    class LanceMetadata {
        +schemaExists(session, schemaName) bool
        +listSchemaNames(session) List~String~
        +getTableHandle(session, tableName, tableVersion) ConnectorTableHandle
        +getTableLayout(session, handle) ConnectorTableLayout
        +getTableMetadata(session, table) ConnectorTableMetadata
        +listTables(session, schemaName) List~SchemaTableName~
        +getColumnHandles(session, tableHandle) Map~String, ColumnHandle~
        +getColumnMetadata(session, tableHandle, columnHandle) ColumnMetadata
        +listTableColumns(session, prefix) Map~SchemaTableName, List~ColumnMetadata~~
        +beginCreateTable(session, tableMetadata, layout) ConnectorOutputTableHandle
        +finishCreateTable(session, tableHandle, fragments, stats) Optional~ConnectorOutputMetadata~
        +beginInsert(session, tableHandle) ConnectorInsertTableHandle
        +finishInsert(session, tableHandle, fragments, stats) Optional~ConnectorOutputMetadata~
        +getTableLayoutForConstraint(session, table, constraint, desiredColumns) ConnectorTableLayoutResult
    }

    class LanceHandleResolver {
        +Class getTableHandleClass()
        +Class getTableLayoutHandleClass()
        +Class getColumnHandleClass()
        +Class getSplitClass()
        +Class getOutputTableHandleClass()
        +Class getInsertTableHandleClass()
        +Class getTransactionHandleClass()
    }

    class LanceTransactionHandle {
        <<enum>>
        +INSTANCE
    }

    class LanceTableHandle {
        -String schemaName
        -String tableName
        +String getSchemaName()
        +String getTableName()
    }

    class LanceTableLayoutHandle {
        -LanceTableHandle table
        -TupleDomain~ColumnHandle~ tupleDomain
        +LanceTableHandle getTable()
        +TupleDomain~ColumnHandle~ getTupleDomain()
    }

    class LanceColumnType {
        <<enum>>
        BIGINT
        INTEGER
        DOUBLE
        FLOAT
        VARCHAR
        TIMESTAMP
        BOOLEAN
        OTHER
        +ArrowType getArrowType()
        +Type getPrestoType()
        +LanceColumnType fromPrestoType(type)
        +LanceColumnType fromArrowType(type)
    }

    class LanceColumnHandle {
        -String columnName
        -Type columnType
        -LanceColumnHandleType type
        +String getColumnName()
        +Type getColumnType()
        +LanceColumnHandleType getType()
        +ColumnMetadata getColumnMetadata()
    }

    class LanceColumnHandleType {
        <<enum>>
        REGULAR
        DERIVED
    }

    class LanceColumnInfo {
        -String columnName
        -LanceColumnType dataType
        +String getColumnName()
        +LanceColumnType getDataType()
    }

    class LanceIngestionTableHandle {
        -String schemaName
        -String tableName
        -List~LanceColumnInfo~ columns
        +String getSchemaName()
        +String getTableName()
        +List~LanceColumnInfo~ getColumns()
    }

    class LancePageWriter {
        -LanceConfig lanceConfig
        -LanceClient lanceClient
        +FragmentMetadata append(page, tableHandle, arrowSchema)
    }

    class LancePageSinkProvider {
        -LanceConfig lanceConfig
        -LancePageWriter lancePageWriter
        +ConnectorPageSink createPageSink(transactionHandle, session, outputTableHandle, pageSinkContext)
        +ConnectorPageSink createPageSink(transactionHandle, session, insertTableHandle, pageSinkContext)
    }

    class LancePageSink {
        -LanceConfig lanceConfig
        -LanceIngestionTableHandle tableHandle
        -LancePageWriter lancePageWriter
        -Schema arrowSchema
        +CompletableFuture appendPage(page)
        +CompletableFuture finish()
        +void abort()
    }

    class FragmentInfo {
        -int fragmentId
        +int getFragmentId()
    }

    class LanceSplit {
        -SplitType splitType
        -Optional~List~FragmentInfo~~ fragments
        +SplitType getSplitType()
        +Optional~List~FragmentInfo~~ getFragments()
    }

    class SplitType {
        <<enum>>
        FRAGMENT
        BROKER
    }

    class LanceSplitSource {
        -LanceClient lanceClient
        -LanceTableHandle lanceTable
        -boolean isFinished
        +CompletableFuture getNextBatch(partitionHandle, maxSize)
        +void close()
        +boolean isFinished()
    }

    class LanceSplitManager {
        -LanceClient lanceClient
        +ConnectorSplitSource getSplits(transactionHandle, session, layout, splitSchedulingContext)
    }

    class LancePageSourceProvider {
        -LanceClient lanceClient
        +ConnectorPageSource createPageSource(transactionHandle, session, split, layout, columns, splitContext)
    }

    class LanceFragmentPageSource {
        -LanceClient lanceClient
        -List~FragmentInfo~ fragmentInfos
        -List~ColumnHandle~ columns
        -Dataset dataset
        -List~DatasetFragment~ fragments
        -PageBuilder pageBuilder
        +long getCompletedBytes()
        +long getCompletedPositions()
        +long getReadTimeNanos()
        +boolean isFinished()
        +Page getNextPage()
        +long getSystemMemoryUsage()
        +void close()
    }

    class ArrowVectorPageBuilder {
        -Type columnType
        -BlockBuilder blockBuilder
        -FieldVector arrowVector
        +build()
        +create(columnType, blockBuilder, arrowVector) ArrowVectorPageBuilder
    }

    %% Relationships
    LancePlugin --> LanceConnectorFactory
    LanceConnectorFactory --> LanceModule
    LanceModule --> LanceConnector
    LanceModule --> LanceClient
    LanceModule --> LanceMetadata
    LanceModule --> LanceSplitManager
    LanceModule --> LancePageSourceProvider
    LanceModule --> LancePageSinkProvider
    LanceModule --> LancePageWriter
    LanceModule --> LanceConfig

    LanceConnector --> LanceMetadata
    LanceConnector --> LanceSplitManager
    LanceConnector --> LancePageSourceProvider
    LanceConnector --> LancePageSinkProvider
    LanceConnector --> LanceTransactionHandle

    LanceMetadata --> LanceClient
    LanceMetadata --> LanceTableHandle
    LanceMetadata --> LanceTableLayoutHandle
    LanceMetadata --> LanceColumnHandle
    LanceMetadata --> LanceColumnInfo
    LanceMetadata --> LanceIngestionTableHandle
    LanceMetadata --> LanceColumnType

    LanceHandleResolver --> LanceTableHandle
    LanceHandleResolver --> LanceTableLayoutHandle
    LanceHandleResolver --> LanceColumnHandle
    LanceHandleResolver --> LanceSplit
    LanceHandleResolver --> LanceIngestionTableHandle
    LanceHandleResolver --> LanceTransactionHandle

    LanceColumnHandle --> LanceColumnType
    LanceColumnHandle --> LanceColumnHandleType
    LanceColumnInfo --> LanceColumnType

    LanceIngestionTableHandle --> LanceColumnInfo

    LancePageSinkProvider --> LancePageSink
    LancePageSinkProvider --> LancePageWriter
    LancePageSink --> LanceIngestionTableHandle
    LancePageSink --> LancePageWriter

    LancePageWriter --> LanceClient

    LanceSplitManager --> LanceSplitSource
    LanceSplitManager --> LanceTableLayoutHandle
    LanceSplitSource --> LanceClient
    LanceSplitSource --> LanceTableHandle
    LanceSplitSource --> LanceSplit
    LanceSplit --> FragmentInfo

    LancePageSourceProvider --> LanceFragmentPageSource
    LancePageSourceProvider --> LanceClient
    LancePageSourceProvider --> LanceSplit
    LancePageSourceProvider --> LanceTableLayoutHandle

    LanceFragmentPageSource --> LanceClient
    LanceFragmentPageSource --> FragmentInfo
    LanceFragmentPageSource --> LanceTableHandle
    LanceFragmentPageSource --> ArrowVectorPageBuilder

    ArrowVectorPageBuilder --> LanceColumnType
Loading

File-Level Changes

Change Details Files
Add new presto-lance Maven module and dependencies for LanceDB and Arrow
  • Register the new presto-lance module in the root pom.xml modules list
  • Create presto-lance/pom.xml with packaging type presto-plugin
  • Declare dependencies on Presto SPI/common, Airlift, Guice/Jackson, lancedb-core/lance-core, and Arrow memory/vector modules
  • Configure duplicate-finder plugin exclusions for Arrow-related resources
pom.xml
presto-lance/pom.xml
Implement Lance connector plugin, factory, module, and configuration
  • Define LancePlugin exposing a LanceConnectorFactory that builds connectors using Bootstrap and Guice
  • Implement LanceConnectorFactory to wire Presto ConnectorContext (NodeManager, TypeManager, etc.) into LanceModule
  • Create LanceModule to bind LanceConfig, LanceClient, LanceConnector, metadata, split manager, page source/sink providers, and page writer as singletons
  • Add LanceConfig with configurable lance.root-url pointing to the Lance dataset root
presto-lance/src/main/java/com/facebook/presto/lance/LancePlugin.java
presto-lance/src/main/java/com/facebook/presto/lance/LanceConnectorFactory.java
presto-lance/src/main/java/com/facebook/presto/lance/LanceModule.java
presto-lance/src/main/java/com/facebook/presto/lance/LanceConfig.java
Provide core connector implementation and handle resolution
  • Implement LanceConnector as a Presto Connector delegating to LanceMetadata, LanceSplitManager, LancePageSourceProvider, and LancePageSinkProvider, returning a singleton transaction handle
  • Create LanceHandleResolver mapping connector handle types to Lance-specific implementations for tables, layouts, columns, splits, and ingestion table handles
  • Add LanceTransactionHandle enum as a simple transaction handle singleton
presto-lance/src/main/java/com/facebook/presto/lance/LanceConnector.java
presto-lance/src/main/java/com/facebook/presto/lance/LanceHandleResolver.java
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceTransactionHandle.java
Implement metadata layer over Lance datasets
  • Create LanceMetadata implementing ConnectorMetadata with a single logical schema and mapping Lance datasets to Presto tables
  • Implement schema and table listing using LanceClient connection tableNames API
  • Build table and column metadata by reading Arrow Schema from LanceClient and mapping to Presto types via LanceColumnType
  • Support table creation, insert, and commit by creating Lance schemas/tables and appending fragments via LanceIngestionTableHandle and LanceClient
  • Provide basic table layout and constraint-based layout construction using LanceTableLayoutHandle
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceTableHandle.java
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceTableLayoutHandle.java
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceColumnHandle.java
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceColumnInfo.java
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceColumnType.java
presto-lance/src/main/java/com/facebook/presto/lance/ingestion/LanceIngestionTableHandle.java
Add Lance client wrapper over LanceDB Java APIs
  • Introduce LanceClient that manages a lancedb Connection and Arrow RootAllocator based on LanceConfig root-url
  • Implement helper methods to create tables, fetch Arrow schemas, open datasets, list DatasetFragments, and compute table paths
  • Provide appendAndCommit and getTableVersion helpers to manage FragmentMetadata-based commits and versioning
presto-lance/src/main/java/com/facebook/presto/lance/client/LanceClient.java
Implement split management and fragment abstraction for scanning
  • Add FragmentInfo as a simple serializable descriptor for a fragment ID
  • Create LanceSplit with split types (BROKER/FRAGMENT) and an optional list of FragmentInfo, implementing ConnectorSplit
  • Implement LanceSplitSource to pull Lance DatasetFragments via LanceClient and emit fragment splits, marking source as finished after one batch
  • Add LanceSplitManager wiring layout handle to LanceSplitSource for query planning
presto-lance/src/main/java/com/facebook/presto/lance/fragments/FragmentInfo.java
presto-lance/src/main/java/com/facebook/presto/lance/splits/LanceSplit.java
presto-lance/src/main/java/com/facebook/presto/lance/splits/LanceSplitSource.java
presto-lance/src/main/java/com/facebook/presto/lance/splits/LanceSplitManager.java
Implement page source for reading from Lance fragments using Arrow
  • Create LanceFragmentPageSource that opens a Lance Dataset from LanceClient, restricts to specific DatasetFragments, and iterates Arrow batches via LanceScanner and ArrowReader
  • Use ArrowVectorPageBuilder to convert Arrow FieldVectors to Presto Blocks for requested columns, tracking completed bytes and positions
  • Handle fragment-by-fragment iteration, closing scanner/reader resources between fragments and exposing ConnectorPageSource metrics
presto-lance/src/main/java/com/facebook/presto/lance/scan/LanceFragmentPageSource.java
presto-lance/src/main/java/com/facebook/presto/lance/scan/LancePageSourceProvider.java
presto-lance/src/main/java/com/facebook/presto/lance/scan/ArrowVectorPageBuilder.java
Implement page sink and writer for appending to Lance tables
  • Add LancePageWriter to materialize a Presto Page into an Arrow VectorSchemaRoot using LanceColumnInfo metadata and write it as a Lance Fragment
  • Implement LancePageSink to accumulate FragmentMetadata for appended pages and return them as JSON-encoded Slice collection in finish()
  • Provide LancePageSinkProvider to build appropriate Arrow Schema from LanceIngestionTableHandle and construct page sinks for both create-table and insert paths
presto-lance/src/main/java/com/facebook/presto/lance/ingestion/LancePageWriter.java
presto-lance/src/main/java/com/facebook/presto/lance/ingestion/LancePageSink.java
presto-lance/src/main/java/com/facebook/presto/lance/ingestion/LancePageSinkProvider.java
Add basic integration test runner for the Lance connector
  • Introduce LanceQueryRunner to stand up a DistributedQueryRunner with the LancePlugin and a default lance.root-url, and to serve as a main entry-point for manual testing
  • Configure session defaults for catalog/schema and Presto HTTP port in the test runner
presto-lance/src/test/java/com/facebook/presto/lance/LanceQueryRunner.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@jja725 jja725 changed the title Presto Lance Connector [WIP]Presto Lance Connector Feb 21, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 6 issues, and left some high level feedback:

  • LanceColumnHandle.hashCode() only uses columnName while equals() also compares columnType, which can violate the hashCode/equals contract and cause subtle map/set issues; consider including columnType in hashCode or aligning equals with the current hash implementation.
  • ArrowVectorPageBuilder maps both DOUBLE and FLOAT to Float8Vector, but FLOAT columns should be read via the appropriate Arrow float vector (e.g., Float4Vector), otherwise REAL/FLOAT values may be misread or corrupted.
  • FragmentInfo exposes getFragmentId(), but LanceFragmentPageSource uses fragmentInfo.getId(), which will not compile; align the accessor usage (or rename the method/JsonProperty) so the split and page source code refer to the same API.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- LanceColumnHandle.hashCode() only uses columnName while equals() also compares columnType, which can violate the hashCode/equals contract and cause subtle map/set issues; consider including columnType in hashCode or aligning equals with the current hash implementation.
- ArrowVectorPageBuilder maps both DOUBLE and FLOAT to Float8Vector, but FLOAT columns should be read via the appropriate Arrow float vector (e.g., Float4Vector), otherwise REAL/FLOAT values may be misread or corrupted.
- FragmentInfo exposes getFragmentId(), but LanceFragmentPageSource uses fragmentInfo.getId(), which will not compile; align the accessor usage (or rename the method/JsonProperty) so the split and page source code refer to the same API.

## Individual Comments

### Comment 1
<location> `presto-lance/src/main/java/com/facebook/presto/lance/scan/LanceFragmentPageSource.java:64` </location>
<code_context>
+        this.columns = requireNonNull(columns, "columns is null");
+        dataset = lanceClient.open(table.getTableName());
+        List<DatasetFragment> allFragments = dataset.getFragments();
+        this.fragments = allFragments.stream().map(fragmentInfo -> allFragments.get(fragmentInfo.getId())).collect(Collectors.toList());
+        this.pageBuilder = new PageBuilder(columns.stream()
+                .map(column -> ((LanceColumnHandle) column).getColumnType())
</code_context>

<issue_to_address>
**issue (bug_risk):** FragmentInfo API usage is incorrect and will not compile

`FragmentInfo` provides `getFragmentId()`, not `getId()`, so this line will not compile. Also, you’re ignoring the `fragmentInfos` passed into the constructor and always using all dataset fragments. Consider instead:

```java
this.fragments = fragmentInfos.stream()
        .map(info -> allFragments.get(info.getFragmentId()))
        .collect(Collectors.toList());
```

This uses the correct accessor and respects the requested fragment subset.
</issue_to_address>

### Comment 2
<location> `presto-lance/src/main/java/com/facebook/presto/lance/scan/LanceFragmentPageSource.java:185-188` </location>
<code_context>
+    }
+
+    @Override
+    public void close()
+            throws IOException
+    {
+        dataset.close();
+    }
+}
</code_context>

<issue_to_address>
**issue (bug_risk):** PageSource close() only closes the dataset and can leak Arrow resources

If `close()` is called while a scan is in progress, the `ArrowReader` and `LanceScanner` remain open and their resources are leaked. Consider delegating to `closeResources()` (or equivalent) so all Arrow/Lance resources are released before closing the dataset:

```java
@Override
public void close() throws IOException
{
    try {
        if (arrowReader != null || scanner != null) {
            closeResources();
        }
    }
    finally {
        dataset.close();
    }
}
```
</issue_to_address>

### Comment 3
<location> `presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java:88-94` </location>
<code_context>
+        String tablePath = getTablePath(tableName);
+        //Create the directory for the table if it's on local file system
+        if (tablePath.startsWith("file:")) {
+            try {
+                new File(new URI(tablePath)).mkdir();
+            }
</code_context>

<issue_to_address>
**issue (bug_risk):** Swallowing all exceptions in getTableHandle makes debugging and error handling difficult

Catching a generic `Exception` and returning `null` means real failures (e.g., connectivity or Lance errors) are indistinguishable from “table not found,” making diagnosis hard. Please catch only the expected failure types, and for unexpected exceptions rethrow or wrap them in a `PrestoException` with an appropriate error code instead of treating them as a missing table.
</issue_to_address>

### Comment 4
<location> `presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java:163` </location>
<code_context>
+                singletonList(prefix.toSchemaTableName()) : listTables(session, Optional.ofNullable(prefix.getSchemaName()));
+        ImmutableMap.Builder<SchemaTableName, List<ColumnMetadata>> columns = ImmutableMap.builder();
+        for (SchemaTableName tableName : tables) {
+            ConnectorTableMetadata tableMetadata = getTableMetadata(session, getTableHandle(session, tableName));
+            if (tableMetadata != null) {
+                columns.put(tableName, tableMetadata.getColumns());
</code_context>

<issue_to_address>
**issue (bug_risk):** Possible NPE path when resolving table handles in listTableColumns

`getTableHandle(session, tableName)` may return `null`, but its result is passed directly into `getTableMetadata` and then cast to `LanceTableHandle`, which can cause an NPE. Consider assigning the handle to a local variable, checking for `null`, and skipping that table when the handle cannot be resolved.
</issue_to_address>

### Comment 5
<location> `presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java:194` </location>
<code_context>
+    {
+        LanceIngestionTableHandle lanceTableHandle = (LanceIngestionTableHandle) tableHandle;
+        List<FragmentMetadata> fragmentMetadataList = fragments.stream()
+                .map(fragmentSlice -> FragmentMetadata.fromJson(new String(fragmentSlice.getBytes())))
+                .collect(Collectors.toList());
+        long tableReadVersion = lanceClient.getTableVersion(lanceTableHandle.getTableName());
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Use an explicit charset when decoding JSON metadata from Slice

`new String(fragmentSlice.getBytes())` depends on the platform default charset and may behave differently across environments. Since the fragment metadata JSON is UTF-8, use `new String(fragmentSlice.getBytes(), StandardCharsets.UTF_8)` (and import `StandardCharsets`) for deterministic decoding.

Suggested implementation:

```java
        List<FragmentMetadata> fragmentMetadataList = fragments.stream()
                .map(fragmentSlice -> FragmentMetadata.fromJson(new String(fragmentSlice.getBytes(), StandardCharsets.UTF_8)))
                .collect(Collectors.toList());

```

Add the following import to the imports section at the top of `LanceMetadata.java`:

`import java.nio.charset.StandardCharsets;`

Place it alongside the other `java.*` imports to keep the import ordering consistent.
</issue_to_address>

### Comment 6
<location> `presto-lance/src/test/java/com/facebook/presto/lance/LanceQueryRunner.java:27` </location>
<code_context>
+import static com.facebook.presto.testing.TestingSession.testSessionBuilder;
+import static java.lang.String.format;
+
+public class LanceQueryRunner
+{
+    private static final Logger log = Logger.get(LanceQueryRunner.class);
</code_context>

<issue_to_address>
**issue (testing):** LanceQueryRunner is added but there are no actual tests using it to validate connector behavior end-to-end.

This provides the right foundation for integration tests, but nothing currently exercises the Lance connector end-to-end. Please add at least a minimal integration test suite (e.g., `TestLanceConnectorSmoke`) using `LanceQueryRunner` to:

- Create a Lance table via Presto, insert rows, and verify they can be read.
- Cover read/write round-trips for key types (BIGINT, INTEGER, DOUBLE/FLOAT, VARCHAR, BOOLEAN, TIMESTAMP).
- Include cases for empty tables and simple predicates (`WHERE` filters) to validate splits and page sources.

Concrete tests against an in-process Presto cluster with the Lance connector will validate that the connector actually works as intended.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@jja725 jja725 changed the title [WIP]Presto Lance Connector (feat) Presto Lance Connector Feb 21, 2026
@jja725 jja725 changed the title (feat) Presto Lance Connector feat(connector) Presto Lance Connector Feb 21, 2026
@jja725 jja725 changed the title feat(connector) Presto Lance Connector feat(connector): Presto Lance Connector Feb 21, 2026
@jja725 jja725 changed the title feat(connector): Presto Lance Connector feat(plugin-lance): Presto Lance Connector Feb 21, 2026
@jja725 jja725 changed the title feat(plugin-lance): Presto Lance Connector feat(plugin-lance): add Presto Lance Connector Feb 21, 2026
@jja725 jja725 changed the title feat(plugin-lance): add Presto Lance Connector feat(connector): Presto Lance Connector Feb 21, 2026
@tdcmeehan tdcmeehan self-assigned this Feb 21, 2026
Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checklist item
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
is checked in the PR description, but there is no documentation for this new connector in the PR.

Please add documentation of the new connector in a new file lance.rst in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/connector.

(You must add lance.rst to the file https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/connector.rst for the new doc page to show up in the Connectors index page.)

@beinan
Copy link
Copy Markdown
Member

beinan commented Feb 24, 2026

Thank you @jja725 for continuing the lance connector work! Great work! thanks a lot!

Copy link
Copy Markdown
Member

@beinan beinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost good, I think it's a good starting point, pls add document and fix the ci.

Also let's work on the velox support of lance once this pr got merged. I think it would show the strength of prestissimo and velox!

LanceTableLayoutHandle layoutHandle = (LanceTableLayoutHandle) layout;
LanceTableHandle tableHandle = layoutHandle.getTable();

List<Fragment> fragments = namespaceHolder.getFragments(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be out of scope of this PR, can we use "partitioned namespace" to implement the partitioning as hive or iceberg table? here is the doc https://lance.org/format/namespace/partitioning-spec/#partition-namespace-naming

we can sync offline for the details

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lance-format/lance#5896 seems like the implementation is still in progress. We can probably add that later for sure.

@jja725 jja725 requested a review from elharo as a code owner February 24, 2026 06:08
Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the documentation! Looks great, just a nit.

steveburnett
steveburnett previously approved these changes Feb 26, 2026
Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs

tdcmeehan pushed a commit to tdcmeehan/velox that referenced this pull request Mar 4, 2026
Review covers code quality, efficiency, and reuse concerns across
the new Lance connector implementation. Identifies 5 high-severity
issues (hardcoded array type, ignored schema parameter, unmanaged
allocator, unsafe serialization, N+1 dataset opens) and 7 medium-
severity issues.

https://claude.ai/code/session_01JdR9Ba8gGhMK3ZRMsLTuhu
beinan
beinan previously approved these changes Mar 4, 2026
Copy link
Copy Markdown
Contributor

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jja725 , from an Arrow perspective it would be good to make use of existing conversion implementations. Since Arrow is a standard, all Presto <-> Arrow conversions should be basically the same and Presto would benefit from having common conversion routines for all modules rather than multiple overlapping implementations.

Although presto-base-arrow-flight is already a library, we could move conversions to a simplified utility library that doesn't depend on Flight, wdyt?

pageBuilder.declarePositions(rowCount);
}

private void writeVectorToBlock(FieldVector vector, BlockBuilder blockBuilder, Type type, int rowCount)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of overlap between ArrowBlockBuilder which is already in presto-base-arrow-flight https://github.com/prestodb/presto/blob/master/presto-base-arrow-flight/src/main/java/com/facebook/plugin/arrow/ArrowBlockBuilder.java#L97. Could you make use of this for Arrow to Presto conversions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can check this out to see how we can integrate it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactor the common piece into a separate module. Let's work together to make arrow module easy to use for all connector ;)

writeBlockToVectorAtOffset(block, vector, type, rowCount, 0);
}

public static void writeBlockToVectorAtOffset(Block block, FieldVector vector, Type type, int rowCount, int offset)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we introduced Presto to Arrow conversion in this PR https://github.com/prestodb/presto/pull/26369/changes#diff-8d0a24e95c2db9de286d10ed8465a20a8cd8ac97d3d43230da11c602d8e0ecfdR106. There is maybe some minor implementation differences, but the underlying conversions look to be the same.

Copy link
Copy Markdown
Contributor Author

@jja725 jja725 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's working in progress, let's merge the current implementation and have a separate refactor PR

@steveburnett
Copy link
Copy Markdown
Contributor

Thank you for the release note! Just a nit that the release process automates the link to the PR, you don't need to add it yourself anymore. As a suggestion, linking to the documentation makes it easier for the reader of the release note to get more information.

== RELEASE NOTES ==

Lance Connector Changes
* Add :doc:`/connector/lance` for reading and writing LanceDB datasets. 

@jja725
Copy link
Copy Markdown
Contributor Author

jja725 commented Mar 9, 2026

Thank you for the release note! Just a nit that the release process automates the link to the PR, you don't need to add it yourself anymore. As a suggestion, linking to the documentation makes it easier for the reader of the release note to get more information.

== RELEASE NOTES ==

Lance Connector Changes
* Add :doc:`/connector/lance` for reading and writing LanceDB datasets. 

done, thanks for the tips

Replace custom Arrow-to-Presto type conversion in LanceArrowToPageScanner
with the existing ArrowBlockBuilder, eliminating ~100 lines of duplicate
code. Add FixedSizeListVector and TimeStampMicroTZVector support to
ArrowBlockBuilder. Switch from PageBuilder to Block[]-based Page
construction for simpler data flow.
@jja725 jja725 dismissed stale reviews from beinan and steveburnett via f5cc9df March 10, 2026 04:24
Move ArrowBlockBuilder, ArrowErrorCode, and ArrowException from
presto-base-arrow-flight into a new presto-base-arrow library module.
This avoids connectors like presto-lance depending directly on the
arrow-flight plugin. Both presto-base-arrow-flight and presto-lance
now depend on this common module instead.
steveburnett
steveburnett previously approved these changes Mar 10, 2026
Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

@beinan beinan merged commit 8e27030 into prestodb:master Mar 12, 2026
83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants