feat: Add syntax support for CREATE VECTOR INDEX#27027
feat: Add syntax support for CREATE VECTOR INDEX#27027skyelves wants to merge 1 commit intoprestodb:masterfrom
Conversation
Reviewer's GuideIntroduce full parser, AST, formatter, and visitor support for a new CREATE VECTOR INDEX statement, including grammar changes, AST node definition, traversal/formatting hooks, and unit tests in TestSqlParser. Class diagram for new CreateVectorIndex AST supportclassDiagram
class Statement
class Node {
}
class NodeLocation
class Identifier
class QualifiedName
class Expression
class Property
class AstVisitor {
+R visitCreateTable(CreateTable node, C context)
+R visitCreateVectorIndex(CreateVectorIndex node, C context)
+R visitCreateType(CreateType node, C context)
}
class AstBuilder {
+Node visitCreateVectorIndex(SqlBaseParser_CreateVectorIndexContext context)
}
class DefaultTraversalVisitor {
+Void visitCreateVectorIndex(CreateVectorIndex node, C context)
}
class SqlFormatter {
+Void visitCreateVectorIndex(CreateVectorIndex node, Integer indent)
}
class CreateVectorIndex {
-Identifier indexName
-QualifiedName tableName
-List~Identifier~ columns
-Optional~Expression~ where
-List~Property~ properties
+CreateVectorIndex(Identifier indexName, QualifiedName tableName, List~Identifier~ columns, Optional~Expression~ where, List~Property~ properties)
+CreateVectorIndex(NodeLocation location, Identifier indexName, QualifiedName tableName, List~Identifier~ columns, Optional~Expression~ where, List~Property~ properties)
-CreateVectorIndex(Optional~NodeLocation~ location, Identifier indexName, QualifiedName tableName, List~Identifier~ columns, Optional~Expression~ where, List~Property~ properties)
+Identifier getIndexName()
+QualifiedName getTableName()
+List~Identifier~ getColumns()
+Optional~Expression~ getWhere()
+List~Property~ getProperties()
+<R,C> R accept(AstVisitor visitor, Object context)
+List~Node~ getChildren()
+int hashCode()
+boolean equals(Object obj)
+String toString()
}
Statement <|-- CreateVectorIndex
Node <|-- Statement
CreateVectorIndex o--> Identifier
CreateVectorIndex o--> QualifiedName
CreateVectorIndex o--> Expression
CreateVectorIndex o--> Property
CreateVectorIndex o--> NodeLocation
AstVisitor <.. CreateVectorIndex : visited_by
CreateVectorIndex ..> AstVisitor : accept
AstBuilder ..> CreateVectorIndex : creates
DefaultTraversalVisitor ..> CreateVectorIndex : traverses
SqlFormatter ..> CreateVectorIndex : formats
AstVisitor <|-- DefaultTraversalVisitor
AstVisitor <|-- SqlFormatter
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff b38d89c...4467128.
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
CreateVectorIndex, consider includingtableNamein bothgetChildren()andDefaultTraversalVisitor.visitCreateVectorIndexso that tree traversals and rewriters consistently see the referenced table alongside the index name, columns, filter, and properties. - You may want to enforce that the
columnslist inCreateVectorIndexis non-empty (similar to other DDL AST nodes) to avoid constructing semantically invalid index definitions at the AST level.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `CreateVectorIndex`, consider including `tableName` in both `getChildren()` and `DefaultTraversalVisitor.visitCreateVectorIndex` so that tree traversals and rewriters consistently see the referenced table alongside the index name, columns, filter, and properties.
- You may want to enforce that the `columns` list in `CreateVectorIndex` is non-empty (similar to other DDL AST nodes) to avoid constructing semantically invalid index definitions at the AST level.
## Individual Comments
### Comment 1
<location> `presto-parser/src/test/java/com/facebook/presto/sql/parser/TestSqlParser.java:1381-1379` </location>
<code_context>
+ new Property(identifier("index_type"), new StringLiteral("ivf_rabitq4")))));
+ }
+
@Test
public void testCreateTableAsSelect()
{
</code_context>
<issue_to_address>
**suggestion (testing):** Add negative parser tests for invalid CREATE VECTOR INDEX syntax
`testCreateVectorIndex` currently covers only successful parses. Please also add negative tests (e.g., missing column list `CREATE VECTOR INDEX idx ON t`, empty column list `ON t()`, malformed `WITH` clause such as `WITH index_type = 'ivf'` without parentheses or with trailing commas, `WHERE` after `WITH`) using `assertFails`/`assertInvalid` to ensure invalid statements are rejected and protect against future grammar regressions.
Suggested implementation:
```java
@Test
public void testCreateVectorIndex()
{
// Basic CREATE VECTOR INDEX
assertStatement("CREATE VECTOR INDEX idx ON t(a, b)",
new CreateVectorIndex(
identifier("idx"),
QualifiedName.of("t"),
ImmutableList.of(identifier("a"), identifier("b")),
Optional.empty(),
ImmutableList.of()));
// With qualified table name
// Negative tests for invalid CREATE VECTOR INDEX syntax
// Missing column list
assertFails("CREATE VECTOR INDEX idx ON t", "Missing column list for vector index");
// Empty column list
assertFails("CREATE VECTOR INDEX idx ON t()", "Empty column list for vector index");
// Malformed WITH clause: missing parentheses
assertFails("CREATE VECTOR INDEX idx ON t(a) WITH index_type = 'ivf'",
"Invalid WITH clause for vector index; expected parenthesized list of properties");
// Malformed WITH clause: trailing comma
assertFails("CREATE VECTOR INDEX idx ON t(a) WITH (index_type = 'ivf',)",
"Invalid WITH clause for vector index; trailing comma is not allowed");
// WHERE clause is not allowed after WITH
assertFails("CREATE VECTOR INDEX idx ON t(a) WITH (index_type = 'ivf') WHERE a > 1",
"WHERE clause is not allowed in CREATE VECTOR INDEX")
```
1. Ensure that `assertFails` is available in this test class with the `(String sql, String message)` signature; if not, adapt these calls to the existing assertion helpers (e.g., `assertInvalid`, or `assertFails(String sql, String expectedMessagePattern)`).
2. If the existing helpers require an exception type (e.g., `ParsingException.class`) or use regex/substring matching, adjust the second argument accordingly or split into multiple overload-appropriate calls.
3. Optionally refine the error messages in the second argument to match the actual parser error messages produced for these invalid statements, or switch to a looser match (substring or regex) if the helper supports it, to avoid brittleness.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
presto-parser/src/test/java/com/facebook/presto/sql/parser/TestSqlParser.java
Show resolved
Hide resolved
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
d0374fb to
39177d5
Compare
|
If you intend to eventually mark this as ready for review, could you please open an RFC? There's already some discussions related to vector indices and it would be good to align. |
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
39177d5 to
4d46681
Compare
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
4d46681 to
18ff840
Compare
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
18ff840 to
d7a17fa
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Doc in .md looks okay overall. Commenting only until @tdcmeehan's request for an RFC is addressed.
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
d7a17fa to
84bc2a3
Compare
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
84bc2a3 to
742cb6f
Compare
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
6233813 to
c872381
Compare
c872381 to
47ad3d0
Compare
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
06070c8 to
a7fafbf
Compare
|
@skyelves Thanks for sharing the updated syntax changes. From a SQL grammar perspective, this appears to be aligned with the RFC direction (e.g., support for the UPDATING FOR clause). Since this PR focuses only on parser-level changes and the Analyzer / SPI integration is expected to follow in subsequent PRs, it would be helpful to ensure that the introduced syntax does not implicitly constrain execution ownership or lifecycle management semantics. In particular, the RFC review comments highlighted the need for connector-managed index lifecycle and incremental maintenance (e.g., snapshot-aware refresh in Iceberg), as well as SPI-based delegation of index build execution. Our expectation is that connectors should be able to internally leverage libraries such as jvector for ANN index materialization and maintenance. It would be important for the upcoming Analyzer / SPI PRs to clarify how index build and update operations will be delegated to the connector layer, so that execution can remain connector-driven rather than engine-managed. This would help ensure alignment with the RFC feedback and support connector-specific implementations going forward. |
| USING relation ON expression mergeCase+ #mergeInto | ||
| USING relation ON expression mergeCase+ #mergeInto | ||
| | CREATE VECTOR INDEX identifier ON TABLE? qualifiedName | ||
| '(' identifier (',' identifier)* ')' |
There was a problem hiding this comment.
As discussed in the RFC review, it may be beneficial to keep the identifier column optional rather than mandatory, particularly for table formats such as Iceberg that provide stable row-level identity via hidden columns (e.g., $row_id). This would allow connectors to automatically fall back to lineage-based identifiers when a user-defined id is not provided.
There was a problem hiding this comment.
Good points. Changed it to optional
| | MERGE INTO qualifiedName (AS? identifier)? | ||
| USING relation ON expression mergeCase+ #mergeInto | ||
| USING relation ON expression mergeCase+ #mergeInto | ||
| | CREATE VECTOR INDEX identifier ON TABLE? qualifiedName |
There was a problem hiding this comment.
Making the TABLE keyword optional may introduce ambiguity if indexing needs to be performed over derived relations or connector-managed index sources, which are required for certain connector-managed embedding and incremental index maintenance workflows discussed in the RFC.
For example, consider a scenario where a user is ingesting raw text data into a base table (e.g., candidates_table), and embeddings are generated separately for efficient vector indexing. In such cases, connectors may adopt a twin-table approach, where the base table stores the raw text data, and a separate materialized view incrementally maintains the generated embeddings based on data freshness in the base table. The vector index would then be created on this materialized view rather than directly on the base table. In this case, the MV acts as a logical relation derived from the base table, and connector-managed incremental refresh (e.g., snapshot-aware updates in Iceberg) ensures index consistency.
It would be helpful to clarify whether the current syntax is expected to support such connector-managed indexing workflows where the index source may be a derived logical relation rather than a physical base table.
There was a problem hiding this comment.
changed TABLE keyword to mandatory. The current syntax only aims to support physical base table.
There was a problem hiding this comment.
Thanks for the clarification. While making the TABLE keyword mandatory clarifies that the current syntax targets physical base tables, it also restricts vector index creation to those tables only. For embedding pipelines where vectors are maintained in derived relations (e.g., materialized views), it would be helpful to allow indexing on those relations as well. To support this, the syntax could use a generic relation reference such as:
CREATE VECTOR INDEX identifier ON qualifiedName
instead of restricting it to ON TABLE qualifiedName. This would allow indexing both base tables and derived relations.
| private final Identifier indexName; | ||
| private final QualifiedName tableName; | ||
| private final List<Identifier> columns; | ||
| private final Optional<Expression> updatingFor; |
There was a problem hiding this comment.
Since this predicate may influence incremental index maintenance behavior, it may be useful to clarify whether this expression is expected to be interpreted by the engine or delegated to the connector during index update planning. This would help ensure alignment with the RFC’s expectation of connector-managed lifecycle and refresh semantics.
There was a problem hiding this comment.
It will be interpreted by the engine to a where predicate.
| builder.append("CREATE VECTOR INDEX "); | ||
| builder.append(formatName(node.getIndexName())); | ||
| builder.append(" ON "); | ||
| builder.append(formatName(node.getTableName())); |
There was a problem hiding this comment.
The formatter currently omits the optional TABLE keyword even when specified in the original statement. It may be preferable to preserve the original syntax to avoid confusion when round-tripping statements via EXPLAIN or query logging.
There was a problem hiding this comment.
Good catch! Added TABLE keyword
|
When you have time, please resolve the file conflicts. |
315047a to
7e6a848
Compare
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
ec46e42 to
a05da76
Compare
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
Summary: Pull Request resolved: prestodb#27027 Differential Revision: D91385788
| USING relation ON expression mergeCase+ #mergeInto | ||
| USING relation ON expression mergeCase+ #mergeInto | ||
| | CREATE VECTOR INDEX identifier ON TABLE qualifiedName | ||
| '(' identifier (',' identifier)? ')' |
There was a problem hiding this comment.
The grammar currently restricts the column list to at most two identifiers (identifier (',' identifier)?). Is the intent to strictly enforce the (row_id, embedding_column) pattern? If so this makes sense for the current design, but it may limit future extensions where multiple embedding columns or additional metadata columns might need to be indexed.
|
|
||
| properties | ||
| : '(' property (',' property)* ')' | ||
| : '(' property (',' property)* ','? ')' |
| | MERGE INTO qualifiedName (AS? identifier)? | ||
| USING relation ON expression mergeCase+ #mergeInto | ||
| USING relation ON expression mergeCase+ #mergeInto | ||
| | CREATE VECTOR INDEX identifier ON TABLE qualifiedName |
There was a problem hiding this comment.
Should it be
CREATE VECTOR INDEX qualifiedName
instead of identifier. Like create view or create table, as index is internally stored as a table.
There was a problem hiding this comment.
Hi @gggrace14 , In our implementation the index is not stored as a catalog table but as a connector-managed index file (Stored in s3 based object storage). Because of this, the index name behaves more like a logical identifier rather than a fully qualified relation. Using identifier seemed appropriate in this case, whereas qualifiedName would imply the index itself is a catalog object similar to CREATE TABLE or CREATE VIEW.
| { | ||
| private final Identifier indexName; |
There was a problem hiding this comment.
indexName should also be QualifiedName?
a05da76 to
93c6c26
Compare
Summary: ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing:** **SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...** **The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.** 2. Statement Analysis: StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties. This results in a structured CreateVectorIndexAnalysis object. 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91385788
93c6c26 to
4467128
Compare
Summary: ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing:** **SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...** **The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.** 2. Statement Analysis: StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties. This results in a structured CreateVectorIndexAnalysis object. 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91385788
Summary: ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing:** **SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...** **The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.** 2. Statement Analysis: StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties. This results in a structured CreateVectorIndexAnalysis object. 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91385788
Summary: Pull Request resolved: prestodb#27307 Pull Request resolved: prestodb#27027 ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing:** **SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...** **The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.** 2. Statement Analysis: StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties. This results in a structured CreateVectorIndexAnalysis object. 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91385788
Summary: Pull Request resolved: prestodb#27307 Pull Request resolved: prestodb#27027 ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing:** **SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...** **The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.** 2. Statement Analysis: StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties. This results in a structured CreateVectorIndexAnalysis object. 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91385788
Differential Revision: D91385788
Summary by Sourcery
Add parser, AST, formatter, and traversal support for a new CREATE VECTOR INDEX SQL statement, including tests.
New Features:
Tests: