feat: Add analysis support for CREATE VECTOR INDEX (#27036)#27036
feat: Add analysis support for CREATE VECTOR INDEX (#27036)#27036skyelves wants to merge 1 commit intoprestodb:masterfrom
Conversation
Reviewer's GuideAdds full parser, AST, formatter, analyzer, and query-type support for a new CREATE VECTOR INDEX statement, along with tests and a design doc describing a future plan-optimizer rewrite to a UDF-based SELECT. Sequence diagram for CREATE VECTOR INDEX statement processingsequenceDiagram
actor Client
participant SqlParser
participant AstBuilder
participant AstTree as AstVisitor_AstBuilder
participant StatementAnalyzer
participant Analysis
participant StatementUtils
participant QueryDispatcher
Client->>SqlParser: parse("CREATE VECTOR INDEX ...")
SqlParser->>AstBuilder: visitCreateVectorIndexContext
AstBuilder->>AstTree: visitCreateVectorIndex(context)
AstTree-->>AstBuilder: CreateVectorIndex AST node
AstBuilder-->>SqlParser: CreateVectorIndex
SqlParser-->>QueryDispatcher: Statement(CreateVectorIndex)
QueryDispatcher->>StatementUtils: getQueryType(CreateVectorIndex.class)
StatementUtils-->>QueryDispatcher: QueryType.SELECT
QueryDispatcher->>StatementAnalyzer: analyze(CreateVectorIndex)
StatementAnalyzer->>Analysis: setCreateVectorIndexTableName(tableName)
StatementAnalyzer-->>QueryDispatcher: Scope(result BOOLEAN)
QueryDispatcher-->>Client: planned SELECT-style execution path
Class diagram for the new CreateVectorIndex AST and related componentsclassDiagram
class Statement {
}
class CreateVectorIndex {
+Identifier indexName
+QualifiedName tableName
+List~Identifier~ columns
+Optional~Expression~ where
+List~Property~ properties
+CreateVectorIndex(Identifier indexName, QualifiedName tableName, List~Identifier~ columns, Optional~Expression~ where, List~Property~ properties)
+CreateVectorIndex(NodeLocation location, Identifier indexName, QualifiedName tableName, List~Identifier~ columns, Optional~Expression~ where, List~Property~ properties)
+Identifier getIndexName()
+QualifiedName getTableName()
+List~Identifier~ getColumns()
+Optional~Expression~ getWhere()
+List~Property~ getProperties()
+<R,C> R accept(AstVisitor visitor, C context)
+List~Node~ getChildren()
}
class Identifier {
}
class QualifiedName {
}
class Expression {
}
class Property {
+Identifier name
+Expression value
}
class AstVisitor {
+<R,C> R visitCreateVectorIndex(CreateVectorIndex node, C context)
}
class DefaultTraversalVisitor {
+Void visitCreateVectorIndex(CreateVectorIndex node, Object context)
}
class SqlFormatter_Visitor {
+Void visitCreateVectorIndex(CreateVectorIndex node, Integer indent)
}
class Analysis {
-Optional~QualifiedObjectName~ createVectorIndexTableName
+void setCreateVectorIndexTableName(QualifiedObjectName tableName)
+Optional~QualifiedObjectName~ getCreateVectorIndexTableName()
}
class StatementAnalyzer_Visitor {
+Scope visitCreateVectorIndex(CreateVectorIndex node, Optional~Scope~ scope)
}
class StatementUtils {
-Map~Class, QueryType~ queryTypes
+QueryType getQueryType(Class statementClass)
}
class QueryType {
<<enum>>
SELECT
DATA_DEFINITION
CONTROL
}
Statement <|-- CreateVectorIndex
AstVisitor <|-- DefaultTraversalVisitor
AstVisitor <|-- SqlFormatter_Visitor
AstVisitor <|-- StatementAnalyzer_Visitor
CreateVectorIndex --> Identifier
CreateVectorIndex --> QualifiedName
CreateVectorIndex --> Expression
CreateVectorIndex --> Property
StatementAnalyzer_Visitor --> Analysis
StatementUtils --> QueryType
StatementUtils ..> CreateVectorIndex
Analysis ..> QualifiedObjectName
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 5022f6b...98cf096. No notifications. |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The change to
propertiesinSqlBase.g4to allow a trailing comma applies to allWITH (...)property lists, not justCREATE VECTOR INDEX; please double-check that this broader grammar relaxation (and the adjusted error expectations) is intentional for all existing statements that useproperties.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The change to `properties` in `SqlBase.g4` to allow a trailing comma applies to all `WITH (...)` property lists, not just `CREATE VECTOR INDEX`; please double-check that this broader grammar relaxation (and the adjusted error expectations) is intentional for all existing statements that use `properties`.
## Individual Comments
### Comment 1
<location> `presto-parser/src/test/java/com/facebook/presto/sql/parser/TestSqlParserErrorHandling.java:83-86` </location>
<code_context>
+ {"CREATE TABLE foo () AS (VALUES 1)",
</code_context>
<issue_to_address>
**suggestion (testing):** Add negative tests for invalid CREATE VECTOR INDEX syntax to the error-handling suite
With the grammar now supporting `CREATE VECTOR INDEX` (and `VECTOR` in the expected tokens), please add negative error-handling tests for malformed vector index statements in `getStatements()`. For instance:
- `CREATE VECTOR INDEX` (missing index name and rest of statement)
- `CREATE VECTOR INDEX idx ON` (missing table)
- `CREATE VECTOR INDEX idx ON t` (missing column list)
- `CREATE VECTOR INDEX idx ON t()` / `CREATE VECTOR INDEX idx ON t(,)` (invalid column list)
These cases help verify clear, stable error messages for common mistakes and protect against grammar regressions around the new syntax.
Suggested implementation:
```java
{"CREATE TABLE foo () AS (VALUES 1)",
"line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
{"CREATE TABLE foo (*) AS (VALUES 1)",
"line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
{"CREATE VECTOR INDEX",
"line 1:20: mismatched input '<EOF>'. Expecting: <identifier>"},
{"CREATE VECTOR INDEX idx ON",
"line 1:29: mismatched input '<EOF>'. Expecting: <identifier>"},
{"CREATE VECTOR INDEX idx ON t",
"line 1:31: mismatched input '<EOF>'. Expecting: '('"},
{"CREATE VECTOR INDEX idx ON t()",
"line 1:32: mismatched input ')'. Expecting: <identifier>"},
{"CREATE VECTOR INDEX idx ON t(,)",
"line 1:32: mismatched input ','. Expecting: <identifier>"},
{"SELECT grouping(a+2) FROM (VALUES (1)) AS t (a) GROUP BY a+2",
```
The exact error column numbers and messages (especially the expected tokens like <identifier> vs a concrete token name) may differ slightly depending on the current ANTLR grammar and error handler configuration in your version of Presto. If test failures occur:
1. Run the tests to see the actual parser error messages for each of the added SQL snippets.
2. Adjust the `line 1:XX:` column indices and the `Expecting: ...` portions in each of the new test cases to match the real output exactly.
3. If your test suite uses a helper to normalize or format error messages, ensure the new expectations follow that convention (e.g., quoting identifiers or token names consistently with nearby tests).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| {"CREATE TABLE foo () AS (VALUES 1)", | ||
| "line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VIEW'"}, | ||
| "line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"}, | ||
| {"CREATE TABLE foo (*) AS (VALUES 1)", | ||
| "line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VIEW'"}, | ||
| "line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"}, |
There was a problem hiding this comment.
suggestion (testing): Add negative tests for invalid CREATE VECTOR INDEX syntax to the error-handling suite
With the grammar now supporting CREATE VECTOR INDEX (and VECTOR in the expected tokens), please add negative error-handling tests for malformed vector index statements in getStatements(). For instance:
CREATE VECTOR INDEX(missing index name and rest of statement)CREATE VECTOR INDEX idx ON(missing table)CREATE VECTOR INDEX idx ON t(missing column list)CREATE VECTOR INDEX idx ON t()/CREATE VECTOR INDEX idx ON t(,)(invalid column list)
These cases help verify clear, stable error messages for common mistakes and protect against grammar regressions around the new syntax.
Suggested implementation:
{"CREATE TABLE foo () AS (VALUES 1)",
"line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
{"CREATE TABLE foo (*) AS (VALUES 1)",
"line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
{"CREATE VECTOR INDEX",
"line 1:20: mismatched input '<EOF>'. Expecting: <identifier>"},
{"CREATE VECTOR INDEX idx ON",
"line 1:29: mismatched input '<EOF>'. Expecting: <identifier>"},
{"CREATE VECTOR INDEX idx ON t",
"line 1:31: mismatched input '<EOF>'. Expecting: '('"},
{"CREATE VECTOR INDEX idx ON t()",
"line 1:32: mismatched input ')'. Expecting: <identifier>"},
{"CREATE VECTOR INDEX idx ON t(,)",
"line 1:32: mismatched input ','. Expecting: <identifier>"},
{"SELECT grouping(a+2) FROM (VALUES (1)) AS t (a) GROUP BY a+2",The exact error column numbers and messages (especially the expected tokens like <identifier> vs a concrete token name) may differ slightly depending on the current ANTLR grammar and error handler configuration in your version of Presto. If test failures occur:
- Run the tests to see the actual parser error messages for each of the added SQL snippets.
- Adjust the
line 1:XX:column indices and theExpecting: ...portions in each of the new test cases to match the real output exactly. - If your test suite uses a helper to normalize or format error messages, ensure the new expectations follow that convention (e.g., quoting identifiers or token names consistently with nearby tests).
aditi-pandit
left a comment
There was a problem hiding this comment.
@skyelves : Thanks for this code. Had couple of comments.
|
|
||
| --- | ||
|
|
||
| ## Files to Modify/Create |
There was a problem hiding this comment.
There isn't any need to repeat all the code in the doc.
|
|
||
| ### 1. UDF Receives Metadata Only | ||
|
|
||
| The `create_local_index` UDF does **NOT** receive actual row data. It receives: |
There was a problem hiding this comment.
What is the need to wrap this is a CREATE VECTOR INDEX statement ?
If we create a statement then it needs to work with all kinds of tables etc... Doesn't seem the code is as generic.
|
@skyelves : Thanks for this work. It might be good if you can write a basic RFC for this. It is quite a complex piece of work that is adding new syntax etc, and also we want to work with Iceberg and specific vector indexing libraries on our side as well, so it would be good to clear out that interface. |
There was a problem hiding this comment.
The .md file appears to be the same file as in PR #27027 .
I feel like I would expect to see CREATE VECTOR INDEX documentation added in this PR, in the form of a .rst file in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/sql.
prestodb#27036) Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
e857c4a to
c4f805d
Compare
prestodb#27036) Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
c4f805d to
96d7f07
Compare
prestodb#27036) Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
prestodb#27036) Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
96d7f07 to
96d0e1d
Compare
Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
96d0e1d to
9508978
Compare
Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
9508978 to
97881cc
Compare
Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
97881cc to
bc37d64
Compare
| distanceMetric, indexOptions, partitionedByJson.toString()); | ||
|
|
||
| // Build synthetic query: SELECT create_vector_index('source_table', 'col1', 'col2', 'type', 'props') | ||
| // No FROM clause — the Python script handles all data access, no table scan needed. |
There was a problem hiding this comment.
Since vector index creation requires scanning table data (e.g., embedding columns), delegating all data access to the Python UDF without a visible table scan prevents the analyzer from registering read dependencies on the indexed table and enforcing column-level SELECT privileges during index build. This effectively indicates that the index build process accesses the underlying data source outside Presto’s planning and execution framework, which may lead to inconsistencies with snapshot isolation, partition pruning, predicate pushdown, resource group enforcement, and access control checks. To maintain governance and consistency guarantees expected in a lakehouse execution model, it would be preferable for index build operations to execute within Presto’s planning and scheduling framework, with the underlying table scan represented in the analysed query similar to CTAS.
There was a problem hiding this comment.
Sorry for the noise. This is actually a bug and I just fixed it. The analysed query should be similiar to CTAS
Summary: ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing: SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ... The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node. 2. Statement Analysis: **StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.** **This results in a structured CreateVectorIndexAnalysis object.** 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91524358
Summary: Pull Request resolved: prestodb#27036 Differential Revision: D91524358
ed2f575 to
59681f3
Compare
Summary: Pull Request resolved: prestodb#27036 ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing: SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ... The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node. 2. Statement Analysis: **StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.** **This results in a structured CreateVectorIndexAnalysis object.** 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91524358
59681f3 to
9ad94e5
Compare
|
|
Summary: Pull Request resolved: prestodb#27036 ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing: SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ... The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node. 2. Statement Analysis: **StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.** **This results in a structured CreateVectorIndexAnalysis object.** 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91524358
Summary: Pull Request resolved: prestodb#27036 ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing: SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ... The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node. 2. Statement Analysis: **StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.** **This results in a structured CreateVectorIndexAnalysis object.** 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91524358
9ad94e5 to
717d698
Compare
Summary: ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing: SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ... The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node. 2. Statement Analysis: **StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.** **This results in a structured CreateVectorIndexAnalysis object.** 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. Differential Revision: D91524358
717d698 to
486f144
Compare
There was a problem hiding this comment.
Thanks @skyelves for this code. Please can you add some tests in TestAnalyzer class for the StatementAnalyzer code.
aditi-pandit
left a comment
There was a problem hiding this comment.
Please add unit tests.
|
Please add a release note - or |
Summary:
## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:
SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:
**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**
3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:
TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):
PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):
TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:
Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.
## Release Notes
Please follow release notes guidelines and fill in the release notes below.
```
== RELEASE NOTES ==
General Changes
* Add support for create-vector-index statement, which creates
vector search indexes on table columns with configurable index properties
and partition filtering via an ``UPDATING FOR`` clause.
```
Differential Revision: D91524358
486f144 to
3632c80
Compare
Thanks, added some tests. Could you take a another look? |
3632c80 to
1e017b8
Compare
added |
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks for adding the tests.
| public static final class CreateVectorIndexAnalysis | ||
| { | ||
| private final QualifiedObjectName sourceTableName; | ||
| private final QualifiedObjectName targetTableName; |
There was a problem hiding this comment.
The index artifact is currently represented using QualifiedObjectName, similar to a table. Since vector indexes may also be implemented as connector-managed artifacts (e.g., external index files or metadata entries), it would be better to treat this as a logical index identifier rather than strictly a physical table. This would allow connectors to map the index name to their own storage model while keeping the engine abstraction consistent across different implementations.
| Map<String, ColumnHandle> sourceColumns = metadataResolver.getColumnHandles(sourceTableHandle); | ||
| for (Identifier column : node.getColumns()) { | ||
| if (!sourceColumns.containsKey(column.getValue())) { | ||
| throw new SemanticException(MISSING_COLUMN, column, "Column '%s' does not exist in source table '%s'", column.getValue(), sourceTableName); |
There was a problem hiding this comment.
The current validation ensures that the specified columns exist in the source table, which is good. However, since the syntax allows either (embedding) or (row_id, embedding), it would be helpful to also validate the column structure. If only one column is provided, it should be validated as an embedding column rather than a row identifier. Additionally, when two columns are specified, they should follow the (row_id, embedding) order(optional). This validation can help prevent invalid cases like (id) from passing analysis and failing later during index creation.
1e017b8 to
fdc220d
Compare
| throw new SemanticException(MISSING_TABLE, node, "Source table '%s' does not exist", sourceTableName); | ||
| } | ||
|
|
||
| QualifiedObjectName targetTable = createQualifiedObjectName(session, node, node.getIndexName(), metadata); |
There was a problem hiding this comment.
Hi @skyelves , here you're creating targetTable of type QualifiedObjectName from node.getIndexName() of type QualifiedName, are you referencing to some example? Is it okay to keep using QualifiedName at the Analyzer layer? Can you help check? Quick look at the Analysis class shows several QualifiedName used there.
Summary: ## High level design The process for executing a CREATE VECTOR INDEX SQL statement is as follows: 1. SQL Input & Parsing: SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ... The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node. 2. Statement Analysis: **StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.** **This results in a structured CreateVectorIndexAnalysis object.** 3. Logical Planning & Query Generation: • LogicalPlanner.createVectorIndexPlan() builds the core execution query: CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ... • The resulting plan tree includes: TableFinishNode(target = CreateVectorIndexReference) └── TableWriterNode(target = CreateVectorIndexReference) └── query plan 4. Connector Plan Optimization (Rewriting): PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization. ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase. 5. Execution and Metadata Handling (For connectors that don't rewrite): TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex(). Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync(). 6. ConnectorMetadata SPI: Default: The standard implementation throws NOT_SUPPORTED. Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls. ## Release Notes ``` == NO RELEASE NOTE == ``` Differential Revision: D91524358 Pulled By: skyelves
Summary:
High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:
StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.
This results in a structured CreateVectorIndexAnalysis object.
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:
TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):
PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):
TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:
Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.
Release Notes
Differential Revision: D91524358
Pulled By: skyelves