feat(connector): Presto Lance Connector #27185
Conversation
Reviewer's GuideIntroduces a new Sequence diagram for reading data via the presto-lance connectorsequenceDiagram
participant QE as Presto_engine
participant LCn as LanceConnector
participant LSM as LanceSplitManager
participant LSS as LanceSplitSource
participant LPSrc as LancePageSourceProvider
participant LFPS as LanceFragmentPageSource
participant LCl as LanceClient
participant LDS as LanceDB_Dataset
QE->>LCn: beginTransaction(isolationLevel, readOnly)
LCn-->>QE: LanceTransactionHandle.INSTANCE
QE->>LCn: getSplitManager()
LCn-->>QE: LanceSplitManager
QE->>LSM: getSplits(transactionHandle, session, layout)
LSM->>LSS: new LanceSplitSource(lanceClient, table)
loop enumerate fragments
LSS->>LCl: getFragments(tableName)
LCl->>LDS: open and list DatasetFragment
LDS-->>LCl: fragments
LCl-->>LSS: List DatasetFragment
LSS-->>QE: LanceSplit for each FragmentInfo
end
QE->>LCn: getPageSourceProvider()
LCn-->>QE: LancePageSourceProvider
QE->>LPSrc: createPageSource(transactionHandle, session, split, layout, columns)
LPSrc->>LCl: open(tableName)
LCl-->>LPSrc: Dataset
LPSrc-->>QE: LanceFragmentPageSource
loop read pages
QE->>LFPS: getNextPage()
alt first batch of fragment
LFPS->>LDS: newScan() on DatasetFragment
LDS-->>LFPS: LanceScanner
end
LFPS->>LDS: scanBatches()
LDS-->>LFPS: ArrowReader
LFPS->>LDS: loadNextBatch()
LDS-->>LFPS: VectorSchemaRoot
LFPS->>LFPS: ArrowVectorPageBuilder.build() per column
LFPS-->>QE: Page
end
QE->>LFPS: close()
LFPS->>LDS: dataset.close()
Sequence diagram for writing data via the presto-lance connectorsequenceDiagram
participant QE as Presto_engine
participant LCn as LanceConnector
participant LMd as LanceMetadata
participant LPSinkProv as LancePageSinkProvider
participant LPSink as LancePageSink
participant LPW as LancePageWriter
participant LCl as LanceClient
participant LDS as LanceDB_Dataset
QE->>LCn: beginTransaction(isolationLevel, readOnly)
LCn-->>QE: LanceTransactionHandle.INSTANCE
Note over QE,LMd: Table creation
QE->>LCn: getMetadata(transactionHandle)
LCn-->>QE: LanceMetadata
QE->>LMd: beginCreateTable(session, tableMetadata, layout)
LMd->>LCl: createTable(tableName, schema)
LCl->>LDS: Dataset.create(rootAllocator, tablePath, schema)
LDS-->>LCl: Dataset
LCl-->>LMd: table created
LMd-->>QE: LanceIngestionTableHandle
Note over QE,LPSink: Data ingestion
QE->>LCn: getPageSinkProvider()
LCn-->>QE: LancePageSinkProvider
QE->>LPSinkProv: createPageSink(transactionHandle, session, outputTableHandle, context)
LPSinkProv->>LPW: use LancePageWriter
LPSinkProv-->>QE: LancePageSink
loop for each Page
QE->>LPSink: appendPage(page)
LPSink->>LPW: append(page, tableHandle, arrowSchema)
LPW->>LCl: getTablePath(tableName)
LPW->>LCl: getArrowRootAllocator()
LPW->>LDS: Fragment.create(tablePath, allocator, VectorSchemaRoot)
LDS-->>LPW: FragmentMetadata
LPW-->>LPSink: FragmentMetadata
end
QE->>LPSink: finish()
LPSink-->>QE: Collection of Slice(fragmentMetadataJson)
Note over QE,LMd: Commit table
QE->>LMd: finishCreateTable(session, tableHandle, fragments, stats)
LMd->>LCl: getTableVersion(tableName)
LCl-->>LMd: tableReadVersion
LMd->>LCl: appendAndCommit(tableName, fragmentMetadataList, tableReadVersion)
LCl->>LDS: Dataset.commit(appendOp, tableReadVersion)
LDS-->>LCl: new version
LCl-->>LMd: committed
LMd-->>QE: Optional.empty()
Class diagram for the main presto-lance connector typesclassDiagram
class LancePlugin {
+Iterable~ConnectorFactory~ getConnectorFactories()
}
class LanceConnectorFactory {
+String getName()
+ConnectorHandleResolver getHandleResolver()
+Connector create(catalogName, config, context)
}
class LanceModule {
+configure(binder)
}
class LanceConnector {
-LanceMetadata metadata
-LanceSplitManager splitManager
-LancePageSourceProvider pageSourceProvider
-LancePageSinkProvider pageSinkProvider
+beginTransaction(isolationLevel, readOnly)
+getMetadata(transactionHandle)
+getSplitManager()
+getPageSourceProvider()
+getPageSinkProvider()
}
class LanceConfig {
-String rootUrl
+String getRootUrl()
+LanceConfig setRootUrl(rootUrl)
}
class LanceClient {
-LanceConfig config
-Connection conn
-RootAllocator arrowRootAllocator
+Connection getConn()
+List~DatasetFragment~ getFragments(tableName)
+RootAllocator getArrowRootAllocator()
+void createTable(tableName, schema)
+Schema getSchema(tableName)
+String getTablePath(tableName)
+long appendAndCommit(tableName, fragmentMetadataList, tableReadVersion)
+long getTableVersion(tableName)
+Dataset open(tableName)
}
class LanceMetadata {
+schemaExists(session, schemaName) bool
+listSchemaNames(session) List~String~
+getTableHandle(session, tableName, tableVersion) ConnectorTableHandle
+getTableLayout(session, handle) ConnectorTableLayout
+getTableMetadata(session, table) ConnectorTableMetadata
+listTables(session, schemaName) List~SchemaTableName~
+getColumnHandles(session, tableHandle) Map~String, ColumnHandle~
+getColumnMetadata(session, tableHandle, columnHandle) ColumnMetadata
+listTableColumns(session, prefix) Map~SchemaTableName, List~ColumnMetadata~~
+beginCreateTable(session, tableMetadata, layout) ConnectorOutputTableHandle
+finishCreateTable(session, tableHandle, fragments, stats) Optional~ConnectorOutputMetadata~
+beginInsert(session, tableHandle) ConnectorInsertTableHandle
+finishInsert(session, tableHandle, fragments, stats) Optional~ConnectorOutputMetadata~
+getTableLayoutForConstraint(session, table, constraint, desiredColumns) ConnectorTableLayoutResult
}
class LanceHandleResolver {
+Class getTableHandleClass()
+Class getTableLayoutHandleClass()
+Class getColumnHandleClass()
+Class getSplitClass()
+Class getOutputTableHandleClass()
+Class getInsertTableHandleClass()
+Class getTransactionHandleClass()
}
class LanceTransactionHandle {
<<enum>>
+INSTANCE
}
class LanceTableHandle {
-String schemaName
-String tableName
+String getSchemaName()
+String getTableName()
}
class LanceTableLayoutHandle {
-LanceTableHandle table
-TupleDomain~ColumnHandle~ tupleDomain
+LanceTableHandle getTable()
+TupleDomain~ColumnHandle~ getTupleDomain()
}
class LanceColumnType {
<<enum>>
BIGINT
INTEGER
DOUBLE
FLOAT
VARCHAR
TIMESTAMP
BOOLEAN
OTHER
+ArrowType getArrowType()
+Type getPrestoType()
+LanceColumnType fromPrestoType(type)
+LanceColumnType fromArrowType(type)
}
class LanceColumnHandle {
-String columnName
-Type columnType
-LanceColumnHandleType type
+String getColumnName()
+Type getColumnType()
+LanceColumnHandleType getType()
+ColumnMetadata getColumnMetadata()
}
class LanceColumnHandleType {
<<enum>>
REGULAR
DERIVED
}
class LanceColumnInfo {
-String columnName
-LanceColumnType dataType
+String getColumnName()
+LanceColumnType getDataType()
}
class LanceIngestionTableHandle {
-String schemaName
-String tableName
-List~LanceColumnInfo~ columns
+String getSchemaName()
+String getTableName()
+List~LanceColumnInfo~ getColumns()
}
class LancePageWriter {
-LanceConfig lanceConfig
-LanceClient lanceClient
+FragmentMetadata append(page, tableHandle, arrowSchema)
}
class LancePageSinkProvider {
-LanceConfig lanceConfig
-LancePageWriter lancePageWriter
+ConnectorPageSink createPageSink(transactionHandle, session, outputTableHandle, pageSinkContext)
+ConnectorPageSink createPageSink(transactionHandle, session, insertTableHandle, pageSinkContext)
}
class LancePageSink {
-LanceConfig lanceConfig
-LanceIngestionTableHandle tableHandle
-LancePageWriter lancePageWriter
-Schema arrowSchema
+CompletableFuture appendPage(page)
+CompletableFuture finish()
+void abort()
}
class FragmentInfo {
-int fragmentId
+int getFragmentId()
}
class LanceSplit {
-SplitType splitType
-Optional~List~FragmentInfo~~ fragments
+SplitType getSplitType()
+Optional~List~FragmentInfo~~ getFragments()
}
class SplitType {
<<enum>>
FRAGMENT
BROKER
}
class LanceSplitSource {
-LanceClient lanceClient
-LanceTableHandle lanceTable
-boolean isFinished
+CompletableFuture getNextBatch(partitionHandle, maxSize)
+void close()
+boolean isFinished()
}
class LanceSplitManager {
-LanceClient lanceClient
+ConnectorSplitSource getSplits(transactionHandle, session, layout, splitSchedulingContext)
}
class LancePageSourceProvider {
-LanceClient lanceClient
+ConnectorPageSource createPageSource(transactionHandle, session, split, layout, columns, splitContext)
}
class LanceFragmentPageSource {
-LanceClient lanceClient
-List~FragmentInfo~ fragmentInfos
-List~ColumnHandle~ columns
-Dataset dataset
-List~DatasetFragment~ fragments
-PageBuilder pageBuilder
+long getCompletedBytes()
+long getCompletedPositions()
+long getReadTimeNanos()
+boolean isFinished()
+Page getNextPage()
+long getSystemMemoryUsage()
+void close()
}
class ArrowVectorPageBuilder {
-Type columnType
-BlockBuilder blockBuilder
-FieldVector arrowVector
+build()
+create(columnType, blockBuilder, arrowVector) ArrowVectorPageBuilder
}
%% Relationships
LancePlugin --> LanceConnectorFactory
LanceConnectorFactory --> LanceModule
LanceModule --> LanceConnector
LanceModule --> LanceClient
LanceModule --> LanceMetadata
LanceModule --> LanceSplitManager
LanceModule --> LancePageSourceProvider
LanceModule --> LancePageSinkProvider
LanceModule --> LancePageWriter
LanceModule --> LanceConfig
LanceConnector --> LanceMetadata
LanceConnector --> LanceSplitManager
LanceConnector --> LancePageSourceProvider
LanceConnector --> LancePageSinkProvider
LanceConnector --> LanceTransactionHandle
LanceMetadata --> LanceClient
LanceMetadata --> LanceTableHandle
LanceMetadata --> LanceTableLayoutHandle
LanceMetadata --> LanceColumnHandle
LanceMetadata --> LanceColumnInfo
LanceMetadata --> LanceIngestionTableHandle
LanceMetadata --> LanceColumnType
LanceHandleResolver --> LanceTableHandle
LanceHandleResolver --> LanceTableLayoutHandle
LanceHandleResolver --> LanceColumnHandle
LanceHandleResolver --> LanceSplit
LanceHandleResolver --> LanceIngestionTableHandle
LanceHandleResolver --> LanceTransactionHandle
LanceColumnHandle --> LanceColumnType
LanceColumnHandle --> LanceColumnHandleType
LanceColumnInfo --> LanceColumnType
LanceIngestionTableHandle --> LanceColumnInfo
LancePageSinkProvider --> LancePageSink
LancePageSinkProvider --> LancePageWriter
LancePageSink --> LanceIngestionTableHandle
LancePageSink --> LancePageWriter
LancePageWriter --> LanceClient
LanceSplitManager --> LanceSplitSource
LanceSplitManager --> LanceTableLayoutHandle
LanceSplitSource --> LanceClient
LanceSplitSource --> LanceTableHandle
LanceSplitSource --> LanceSplit
LanceSplit --> FragmentInfo
LancePageSourceProvider --> LanceFragmentPageSource
LancePageSourceProvider --> LanceClient
LancePageSourceProvider --> LanceSplit
LancePageSourceProvider --> LanceTableLayoutHandle
LanceFragmentPageSource --> LanceClient
LanceFragmentPageSource --> FragmentInfo
LanceFragmentPageSource --> LanceTableHandle
LanceFragmentPageSource --> ArrowVectorPageBuilder
ArrowVectorPageBuilder --> LanceColumnType
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 6 issues, and left some high level feedback:
- LanceColumnHandle.hashCode() only uses columnName while equals() also compares columnType, which can violate the hashCode/equals contract and cause subtle map/set issues; consider including columnType in hashCode or aligning equals with the current hash implementation.
- ArrowVectorPageBuilder maps both DOUBLE and FLOAT to Float8Vector, but FLOAT columns should be read via the appropriate Arrow float vector (e.g., Float4Vector), otherwise REAL/FLOAT values may be misread or corrupted.
- FragmentInfo exposes getFragmentId(), but LanceFragmentPageSource uses fragmentInfo.getId(), which will not compile; align the accessor usage (or rename the method/JsonProperty) so the split and page source code refer to the same API.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- LanceColumnHandle.hashCode() only uses columnName while equals() also compares columnType, which can violate the hashCode/equals contract and cause subtle map/set issues; consider including columnType in hashCode or aligning equals with the current hash implementation.
- ArrowVectorPageBuilder maps both DOUBLE and FLOAT to Float8Vector, but FLOAT columns should be read via the appropriate Arrow float vector (e.g., Float4Vector), otherwise REAL/FLOAT values may be misread or corrupted.
- FragmentInfo exposes getFragmentId(), but LanceFragmentPageSource uses fragmentInfo.getId(), which will not compile; align the accessor usage (or rename the method/JsonProperty) so the split and page source code refer to the same API.
## Individual Comments
### Comment 1
<location> `presto-lance/src/main/java/com/facebook/presto/lance/scan/LanceFragmentPageSource.java:64` </location>
<code_context>
+ this.columns = requireNonNull(columns, "columns is null");
+ dataset = lanceClient.open(table.getTableName());
+ List<DatasetFragment> allFragments = dataset.getFragments();
+ this.fragments = allFragments.stream().map(fragmentInfo -> allFragments.get(fragmentInfo.getId())).collect(Collectors.toList());
+ this.pageBuilder = new PageBuilder(columns.stream()
+ .map(column -> ((LanceColumnHandle) column).getColumnType())
</code_context>
<issue_to_address>
**issue (bug_risk):** FragmentInfo API usage is incorrect and will not compile
`FragmentInfo` provides `getFragmentId()`, not `getId()`, so this line will not compile. Also, you’re ignoring the `fragmentInfos` passed into the constructor and always using all dataset fragments. Consider instead:
```java
this.fragments = fragmentInfos.stream()
.map(info -> allFragments.get(info.getFragmentId()))
.collect(Collectors.toList());
```
This uses the correct accessor and respects the requested fragment subset.
</issue_to_address>
### Comment 2
<location> `presto-lance/src/main/java/com/facebook/presto/lance/scan/LanceFragmentPageSource.java:185-188` </location>
<code_context>
+ }
+
+ @Override
+ public void close()
+ throws IOException
+ {
+ dataset.close();
+ }
+}
</code_context>
<issue_to_address>
**issue (bug_risk):** PageSource close() only closes the dataset and can leak Arrow resources
If `close()` is called while a scan is in progress, the `ArrowReader` and `LanceScanner` remain open and their resources are leaked. Consider delegating to `closeResources()` (or equivalent) so all Arrow/Lance resources are released before closing the dataset:
```java
@Override
public void close() throws IOException
{
try {
if (arrowReader != null || scanner != null) {
closeResources();
}
}
finally {
dataset.close();
}
}
```
</issue_to_address>
### Comment 3
<location> `presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java:88-94` </location>
<code_context>
+ String tablePath = getTablePath(tableName);
+ //Create the directory for the table if it's on local file system
+ if (tablePath.startsWith("file:")) {
+ try {
+ new File(new URI(tablePath)).mkdir();
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Swallowing all exceptions in getTableHandle makes debugging and error handling difficult
Catching a generic `Exception` and returning `null` means real failures (e.g., connectivity or Lance errors) are indistinguishable from “table not found,” making diagnosis hard. Please catch only the expected failure types, and for unexpected exceptions rethrow or wrap them in a `PrestoException` with an appropriate error code instead of treating them as a missing table.
</issue_to_address>
### Comment 4
<location> `presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java:163` </location>
<code_context>
+ singletonList(prefix.toSchemaTableName()) : listTables(session, Optional.ofNullable(prefix.getSchemaName()));
+ ImmutableMap.Builder<SchemaTableName, List<ColumnMetadata>> columns = ImmutableMap.builder();
+ for (SchemaTableName tableName : tables) {
+ ConnectorTableMetadata tableMetadata = getTableMetadata(session, getTableHandle(session, tableName));
+ if (tableMetadata != null) {
+ columns.put(tableName, tableMetadata.getColumns());
</code_context>
<issue_to_address>
**issue (bug_risk):** Possible NPE path when resolving table handles in listTableColumns
`getTableHandle(session, tableName)` may return `null`, but its result is passed directly into `getTableMetadata` and then cast to `LanceTableHandle`, which can cause an NPE. Consider assigning the handle to a local variable, checking for `null`, and skipping that table when the handle cannot be resolved.
</issue_to_address>
### Comment 5
<location> `presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java:194` </location>
<code_context>
+ {
+ LanceIngestionTableHandle lanceTableHandle = (LanceIngestionTableHandle) tableHandle;
+ List<FragmentMetadata> fragmentMetadataList = fragments.stream()
+ .map(fragmentSlice -> FragmentMetadata.fromJson(new String(fragmentSlice.getBytes())))
+ .collect(Collectors.toList());
+ long tableReadVersion = lanceClient.getTableVersion(lanceTableHandle.getTableName());
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Use an explicit charset when decoding JSON metadata from Slice
`new String(fragmentSlice.getBytes())` depends on the platform default charset and may behave differently across environments. Since the fragment metadata JSON is UTF-8, use `new String(fragmentSlice.getBytes(), StandardCharsets.UTF_8)` (and import `StandardCharsets`) for deterministic decoding.
Suggested implementation:
```java
List<FragmentMetadata> fragmentMetadataList = fragments.stream()
.map(fragmentSlice -> FragmentMetadata.fromJson(new String(fragmentSlice.getBytes(), StandardCharsets.UTF_8)))
.collect(Collectors.toList());
```
Add the following import to the imports section at the top of `LanceMetadata.java`:
`import java.nio.charset.StandardCharsets;`
Place it alongside the other `java.*` imports to keep the import ordering consistent.
</issue_to_address>
### Comment 6
<location> `presto-lance/src/test/java/com/facebook/presto/lance/LanceQueryRunner.java:27` </location>
<code_context>
+import static com.facebook.presto.testing.TestingSession.testSessionBuilder;
+import static java.lang.String.format;
+
+public class LanceQueryRunner
+{
+ private static final Logger log = Logger.get(LanceQueryRunner.class);
</code_context>
<issue_to_address>
**issue (testing):** LanceQueryRunner is added but there are no actual tests using it to validate connector behavior end-to-end.
This provides the right foundation for integration tests, but nothing currently exercises the Lance connector end-to-end. Please add at least a minimal integration test suite (e.g., `TestLanceConnectorSmoke`) using `LanceQueryRunner` to:
- Create a Lance table via Presto, insert rows, and verify they can be read.
- Cover read/write round-trips for key types (BIGINT, INTEGER, DOUBLE/FLOAT, VARCHAR, BOOLEAN, TIMESTAMP).
- Include cases for empty tables and simple predicates (`WHERE` filters) to validate splits and page sources.
Concrete tests against an in-process Presto cluster with the Lance connector will validate that the connector actually works as intended.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
presto-lance/src/main/java/com/facebook/presto/lance/scan/LanceFragmentPageSource.java
Outdated
Show resolved
Hide resolved
presto-lance/src/main/java/com/facebook/presto/lance/scan/LanceFragmentPageSource.java
Outdated
Show resolved
Hide resolved
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java
Outdated
Show resolved
Hide resolved
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java
Outdated
Show resolved
Hide resolved
presto-lance/src/main/java/com/facebook/presto/lance/metadata/LanceMetadata.java
Outdated
Show resolved
Hide resolved
presto-lance/src/test/java/com/facebook/presto/lance/LanceQueryRunner.java
Show resolved
Hide resolved
990051a to
a35e233
Compare
steveburnett
left a comment
There was a problem hiding this comment.
The checklist item
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
is checked in the PR description, but there is no documentation for this new connector in the PR.
Please add documentation of the new connector in a new file lance.rst in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/connector.
(You must add lance.rst to the file https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/connector.rst for the new doc page to show up in the Connectors index page.)
|
Thank you @jja725 for continuing the lance connector work! Great work! thanks a lot! |
beinan
left a comment
There was a problem hiding this comment.
almost good, I think it's a good starting point, pls add document and fix the ci.
Also let's work on the velox support of lance once this pr got merged. I think it would show the strength of prestissimo and velox!
| LanceTableLayoutHandle layoutHandle = (LanceTableLayoutHandle) layout; | ||
| LanceTableHandle tableHandle = layoutHandle.getTable(); | ||
|
|
||
| List<Fragment> fragments = namespaceHolder.getFragments( |
There was a problem hiding this comment.
might be out of scope of this PR, can we use "partitioned namespace" to implement the partitioning as hive or iceberg table? here is the doc https://lance.org/format/namespace/partitioning-spec/#partition-namespace-naming
we can sync offline for the details
There was a problem hiding this comment.
lance-format/lance#5896 seems like the implementation is still in progress. We can probably add that later for sure.
48d0118 to
e26305e
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Thank you for the documentation! Looks great, just a nit.
Review covers code quality, efficiency, and reuse concerns across the new Lance connector implementation. Identifies 5 high-severity issues (hardcoded array type, ignored schema parameter, unmanaged allocator, unsafe serialization, N+1 dataset opens) and 7 medium- severity issues. https://claude.ai/code/session_01JdR9Ba8gGhMK3ZRMsLTuhu
BryanCutler
left a comment
There was a problem hiding this comment.
Thanks @jja725 , from an Arrow perspective it would be good to make use of existing conversion implementations. Since Arrow is a standard, all Presto <-> Arrow conversions should be basically the same and Presto would benefit from having common conversion routines for all modules rather than multiple overlapping implementations.
Although presto-base-arrow-flight is already a library, we could move conversions to a simplified utility library that doesn't depend on Flight, wdyt?
| pageBuilder.declarePositions(rowCount); | ||
| } | ||
|
|
||
| private void writeVectorToBlock(FieldVector vector, BlockBuilder blockBuilder, Type type, int rowCount) |
There was a problem hiding this comment.
There is a lot of overlap between ArrowBlockBuilder which is already in presto-base-arrow-flight https://github.com/prestodb/presto/blob/master/presto-base-arrow-flight/src/main/java/com/facebook/plugin/arrow/ArrowBlockBuilder.java#L97. Could you make use of this for Arrow to Presto conversions?
There was a problem hiding this comment.
Sure I can check this out to see how we can integrate it.
There was a problem hiding this comment.
I refactor the common piece into a separate module. Let's work together to make arrow module easy to use for all connector ;)
| writeBlockToVectorAtOffset(block, vector, type, rowCount, 0); | ||
| } | ||
|
|
||
| public static void writeBlockToVectorAtOffset(Block block, FieldVector vector, Type type, int rowCount, int offset) |
There was a problem hiding this comment.
Similarly, we introduced Presto to Arrow conversion in this PR https://github.com/prestodb/presto/pull/26369/changes#diff-8d0a24e95c2db9de286d10ed8465a20a8cd8ac97d3d43230da11c602d8e0ecfdR106. There is maybe some minor implementation differences, but the underlying conversions look to be the same.
There was a problem hiding this comment.
Since it's working in progress, let's merge the current implementation and have a separate refactor PR
|
Thank you for the release note! Just a nit that the release process automates the link to the PR, you don't need to add it yourself anymore. As a suggestion, linking to the documentation makes it easier for the reader of the release note to get more information. |
done, thanks for the tips |
Replace custom Arrow-to-Presto type conversion in LanceArrowToPageScanner with the existing ArrowBlockBuilder, eliminating ~100 lines of duplicate code. Add FixedSizeListVector and TimeStampMicroTZVector support to ArrowBlockBuilder. Switch from PageBuilder to Block[]-based Page construction for simpler data flow.
Move ArrowBlockBuilder, ArrowErrorCode, and ArrowException from presto-base-arrow-flight into a new presto-base-arrow library module. This avoids connectors like presto-lance depending directly on the arrow-flight plugin. Both presto-base-arrow-flight and presto-lance now depend on this common module instead.
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
Description
Add a new Presto connector for LanceDB, a columnar data format optimized for ML/AI workloads built on Apache Arrow. This connector enables Presto to read from and write to Lance datasets using the lance-core Java SDK.
The implementation is similar to the lance-trino connector, adapted to Presto's SPI conventions.
Key components:
LanceArrowToPageScannerLancePageSinkMotivation and Context
LanceDB is an emerging columnar format designed for ML/AI vector data workloads. Adding a Presto connector allows users to query Lance datasets using standard SQL, enabling integration with existing data infrastructure.
Continues the work from #22749.
Impact
New connector plugin — no changes to existing Presto code. Adds the `presto-lance` module to the build.
New configuration properties:
Test Plan
All unit tests pass: `./mvnw test -pl presto-lance`
Contributor checklist
Release Notes