Skip to content

feat: Add analysis support for CREATE VECTOR INDEX (#27036)#27036

Closed
skyelves wants to merge 1 commit intoprestodb:masterfrom
skyelves:export-D91524358
Closed

feat: Add analysis support for CREATE VECTOR INDEX (#27036)#27036
skyelves wants to merge 1 commit intoprestodb:masterfrom
skyelves:export-D91524358

Conversation

@skyelves
Copy link
Copy Markdown
Member

@skyelves skyelves commented Jan 26, 2026

Summary:

High level design

The process for executing a CREATE VECTOR INDEX SQL statement is as follows:

  1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.
This results in a structured CreateVectorIndexAnalysis object.

  1. Logical Planning & Query Generation:
    • LogicalPlanner.createVectorIndexPlan() builds the core execution query:
    CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
    • The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.

Release Notes

== NO RELEASE NOTE ==

Differential Revision: D91524358

Pulled By: skyelves

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Jan 26, 2026

Reviewer's Guide

Adds full parser, AST, formatter, analyzer, and query-type support for a new CREATE VECTOR INDEX statement, along with tests and a design doc describing a future plan-optimizer rewrite to a UDF-based SELECT.

Sequence diagram for CREATE VECTOR INDEX statement processing

sequenceDiagram
    actor Client
    participant SqlParser
    participant AstBuilder
    participant AstTree as AstVisitor_AstBuilder
    participant StatementAnalyzer
    participant Analysis
    participant StatementUtils
    participant QueryDispatcher

    Client->>SqlParser: parse("CREATE VECTOR INDEX ...")
    SqlParser->>AstBuilder: visitCreateVectorIndexContext
    AstBuilder->>AstTree: visitCreateVectorIndex(context)
    AstTree-->>AstBuilder: CreateVectorIndex AST node
    AstBuilder-->>SqlParser: CreateVectorIndex
    SqlParser-->>QueryDispatcher: Statement(CreateVectorIndex)

    QueryDispatcher->>StatementUtils: getQueryType(CreateVectorIndex.class)
    StatementUtils-->>QueryDispatcher: QueryType.SELECT

    QueryDispatcher->>StatementAnalyzer: analyze(CreateVectorIndex)
    StatementAnalyzer->>Analysis: setCreateVectorIndexTableName(tableName)
    StatementAnalyzer-->>QueryDispatcher: Scope(result BOOLEAN)

    QueryDispatcher-->>Client: planned SELECT-style execution path
Loading

Class diagram for the new CreateVectorIndex AST and related components

classDiagram
    class Statement {
    }

    class CreateVectorIndex {
        +Identifier indexName
        +QualifiedName tableName
        +List~Identifier~ columns
        +Optional~Expression~ where
        +List~Property~ properties
        +CreateVectorIndex(Identifier indexName, QualifiedName tableName, List~Identifier~ columns, Optional~Expression~ where, List~Property~ properties)
        +CreateVectorIndex(NodeLocation location, Identifier indexName, QualifiedName tableName, List~Identifier~ columns, Optional~Expression~ where, List~Property~ properties)
        +Identifier getIndexName()
        +QualifiedName getTableName()
        +List~Identifier~ getColumns()
        +Optional~Expression~ getWhere()
        +List~Property~ getProperties()
        +<R,C> R accept(AstVisitor visitor, C context)
        +List~Node~ getChildren()
    }

    class Identifier {
    }

    class QualifiedName {
    }

    class Expression {
    }

    class Property {
        +Identifier name
        +Expression value
    }

    class AstVisitor {
        +<R,C> R visitCreateVectorIndex(CreateVectorIndex node, C context)
    }

    class DefaultTraversalVisitor {
        +Void visitCreateVectorIndex(CreateVectorIndex node, Object context)
    }

    class SqlFormatter_Visitor {
        +Void visitCreateVectorIndex(CreateVectorIndex node, Integer indent)
    }

    class Analysis {
        -Optional~QualifiedObjectName~ createVectorIndexTableName
        +void setCreateVectorIndexTableName(QualifiedObjectName tableName)
        +Optional~QualifiedObjectName~ getCreateVectorIndexTableName()
    }

    class StatementAnalyzer_Visitor {
        +Scope visitCreateVectorIndex(CreateVectorIndex node, Optional~Scope~ scope)
    }

    class StatementUtils {
        -Map~Class, QueryType~ queryTypes
        +QueryType getQueryType(Class statementClass)
    }

    class QueryType {
        <<enum>>
        SELECT
        DATA_DEFINITION
        CONTROL
    }

    Statement <|-- CreateVectorIndex
    AstVisitor <|-- DefaultTraversalVisitor
    AstVisitor <|-- SqlFormatter_Visitor
    AstVisitor <|-- StatementAnalyzer_Visitor

    CreateVectorIndex --> Identifier
    CreateVectorIndex --> QualifiedName
    CreateVectorIndex --> Expression
    CreateVectorIndex --> Property

    StatementAnalyzer_Visitor --> Analysis
    StatementUtils --> QueryType
    StatementUtils ..> CreateVectorIndex
    Analysis ..> QualifiedObjectName
Loading

File-Level Changes

Change Details Files
Introduce CreateVectorIndex AST node and parsing/formatting support for CREATE VECTOR INDEX statements with columns, optional WHERE, and WITH properties.
  • Add CreateVectorIndex statement node class with index name, table name, columns, optional where clause, and properties.
  • Extend ANTLR grammar to recognize CREATE VECTOR INDEX syntax, allow VECTOR/INDEX as non-reserved keywords, and loosen properties list to allow optional trailing comma.
  • Implement AstBuilder.visitCreateVectorIndex to build the new AST from the parser context.
  • Teach AstVisitor and DefaultTraversalVisitor how to visit/traverse CreateVectorIndex nodes.
  • Add SqlFormatter support to render CREATE VECTOR INDEX statements, including formatting of columns, WHERE, and WITH properties.
  • Extend parser success tests to cover basic, WHERE, WITH properties, and combined CREATE VECTOR INDEX forms.
  • Update parser error-handling expectations to include VECTOR in certain CREATE TABLE error messages.
presto-parser/src/main/java/com/facebook/presto/sql/tree/CreateVectorIndex.java
presto-parser/src/main/java/com/facebook/presto/sql/parser/AstBuilder.java
presto-parser/src/main/java/com/facebook/presto/sql/tree/AstVisitor.java
presto-parser/src/main/java/com/facebook/presto/sql/tree/DefaultTraversalVisitor.java
presto-parser/src/main/java/com/facebook/presto/sql/SqlFormatter.java
presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4
presto-parser/src/test/java/com/facebook/presto/sql/parser/TestSqlParser.java
presto-parser/src/test/java/com/facebook/presto/sql/parser/TestSqlParserErrorHandling.java
Wire CREATE VECTOR INDEX into analysis and classification layers so it can be treated as a SELECT-like statement and participate in planning later.
  • Extend Analysis to store the qualified table name associated with a CREATE VECTOR INDEX statement and expose setter/getter.
  • Update StatementAnalyzer to analyze CreateVectorIndex, resolving table name, validating properties, and setting an output boolean field scope.
  • Register CreateVectorIndex in StatementUtils with QueryType.SELECT so it goes through the normal query execution path instead of DDL.
presto-analyzer/src/main/java/com/facebook/presto/sql/analyzer/Analysis.java
presto-main-base/src/main/java/com/facebook/presto/sql/analyzer/StatementAnalyzer.java
presto-analyzer/src/main/java/com/facebook/presto/sql/analyzer/utils/StatementUtils.java
Document the planned plan-optimizer rewrite of CREATE VECTOR INDEX into a metadata-only UDF SELECT invocation.
  • Add a detailed design document describing how CREATE VECTOR INDEX will later be transformed at the plan optimizer level into a SELECT that calls a Python UDF with table/column/where/properties metadata.
  • Outline phases from parsing to optimization, the shape of the CreateVectorIndexNode, the optimizer rule, and the UDF contract.
CREATE_VECTOR_INDEX_PLAN_OPTIMIZER.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 26, 2026

Codenotify: Notifying subscribers in CODENOTIFY files for diff 5022f6b...98cf096.

No notifications.

@skyelves skyelves changed the title feat[Vector Search][2/n]: Add syntax support for CREATE VECTOR INDEX feat: Add syntax support for CREATE VECTOR INDEX Jan 26, 2026
@skyelves skyelves changed the title feat: Add syntax support for CREATE VECTOR INDEX feat: Add analysis support for CREATE VECTOR INDEX Jan 26, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The change to properties in SqlBase.g4 to allow a trailing comma applies to all WITH (...) property lists, not just CREATE VECTOR INDEX; please double-check that this broader grammar relaxation (and the adjusted error expectations) is intentional for all existing statements that use properties.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The change to `properties` in `SqlBase.g4` to allow a trailing comma applies to all `WITH (...)` property lists, not just `CREATE VECTOR INDEX`; please double-check that this broader grammar relaxation (and the adjusted error expectations) is intentional for all existing statements that use `properties`.

## Individual Comments

### Comment 1
<location> `presto-parser/src/test/java/com/facebook/presto/sql/parser/TestSqlParserErrorHandling.java:83-86` </location>
<code_context>
+                 {"CREATE TABLE foo () AS (VALUES 1)",
</code_context>

<issue_to_address>
**suggestion (testing):** Add negative tests for invalid CREATE VECTOR INDEX syntax to the error-handling suite

With the grammar now supporting `CREATE VECTOR INDEX` (and `VECTOR` in the expected tokens), please add negative error-handling tests for malformed vector index statements in `getStatements()`. For instance:

- `CREATE VECTOR INDEX` (missing index name and rest of statement)
- `CREATE VECTOR INDEX idx ON` (missing table)
- `CREATE VECTOR INDEX idx ON t` (missing column list)
- `CREATE VECTOR INDEX idx ON t()` / `CREATE VECTOR INDEX idx ON t(,)` (invalid column list)

These cases help verify clear, stable error messages for common mistakes and protect against grammar regressions around the new syntax.

Suggested implementation:

```java
                {"CREATE TABLE foo () AS (VALUES 1)",
                 "line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
                {"CREATE TABLE foo (*) AS (VALUES 1)",
                 "line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
                {"CREATE VECTOR INDEX",
                 "line 1:20: mismatched input '<EOF>'. Expecting: <identifier>"},
                {"CREATE VECTOR INDEX idx ON",
                 "line 1:29: mismatched input '<EOF>'. Expecting: <identifier>"},
                {"CREATE VECTOR INDEX idx ON t",
                 "line 1:31: mismatched input '<EOF>'. Expecting: '('"},
                {"CREATE VECTOR INDEX idx ON t()",
                 "line 1:32: mismatched input ')'. Expecting: <identifier>"},
                {"CREATE VECTOR INDEX idx ON t(,)",
                 "line 1:32: mismatched input ','. Expecting: <identifier>"},
                {"SELECT grouping(a+2) FROM (VALUES (1)) AS t (a) GROUP BY a+2",

```

The exact error column numbers and messages (especially the expected tokens like &lt;identifier&gt; vs a concrete token name) may differ slightly depending on the current ANTLR grammar and error handler configuration in your version of Presto. If test failures occur:
1. Run the tests to see the actual parser error messages for each of the added SQL snippets.
2. Adjust the `line 1:XX:` column indices and the `Expecting: ...` portions in each of the new test cases to match the real output exactly.
3. If your test suite uses a helper to normalize or format error messages, ensure the new expectations follow that convention (e.g., quoting identifiers or token names consistently with nearby tests).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 83 to +86
{"CREATE TABLE foo () AS (VALUES 1)",
"line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VIEW'"},
"line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
{"CREATE TABLE foo (*) AS (VALUES 1)",
"line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VIEW'"},
"line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add negative tests for invalid CREATE VECTOR INDEX syntax to the error-handling suite

With the grammar now supporting CREATE VECTOR INDEX (and VECTOR in the expected tokens), please add negative error-handling tests for malformed vector index statements in getStatements(). For instance:

  • CREATE VECTOR INDEX (missing index name and rest of statement)
  • CREATE VECTOR INDEX idx ON (missing table)
  • CREATE VECTOR INDEX idx ON t (missing column list)
  • CREATE VECTOR INDEX idx ON t() / CREATE VECTOR INDEX idx ON t(,) (invalid column list)

These cases help verify clear, stable error messages for common mistakes and protect against grammar regressions around the new syntax.

Suggested implementation:

                {"CREATE TABLE foo () AS (VALUES 1)",
                 "line 1:19: mismatched input ')'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
                {"CREATE TABLE foo (*) AS (VALUES 1)",
                 "line 1:19: mismatched input '*'. Expecting: 'FUNCTION', 'MATERIALIZED', 'OR', 'ROLE', 'SCHEMA', 'TABLE', 'TEMPORARY', 'TYPE', 'VECTOR', 'VIEW'"},
                {"CREATE VECTOR INDEX",
                 "line 1:20: mismatched input '<EOF>'. Expecting: <identifier>"},
                {"CREATE VECTOR INDEX idx ON",
                 "line 1:29: mismatched input '<EOF>'. Expecting: <identifier>"},
                {"CREATE VECTOR INDEX idx ON t",
                 "line 1:31: mismatched input '<EOF>'. Expecting: '('"},
                {"CREATE VECTOR INDEX idx ON t()",
                 "line 1:32: mismatched input ')'. Expecting: <identifier>"},
                {"CREATE VECTOR INDEX idx ON t(,)",
                 "line 1:32: mismatched input ','. Expecting: <identifier>"},
                {"SELECT grouping(a+2) FROM (VALUES (1)) AS t (a) GROUP BY a+2",

The exact error column numbers and messages (especially the expected tokens like <identifier> vs a concrete token name) may differ slightly depending on the current ANTLR grammar and error handler configuration in your version of Presto. If test failures occur:

  1. Run the tests to see the actual parser error messages for each of the added SQL snippets.
  2. Adjust the line 1:XX: column indices and the Expecting: ... portions in each of the new test cases to match the real output exactly.
  3. If your test suite uses a helper to normalize or format error messages, ensure the new expectations follow that convention (e.g., quoting identifiers or token names consistently with nearby tests).

Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skyelves : Thanks for this code. Had couple of comments.


---

## Files to Modify/Create
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any need to repeat all the code in the doc.


### 1. UDF Receives Metadata Only

The `create_local_index` UDF does **NOT** receive actual row data. It receives:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the need to wrap this is a CREATE VECTOR INDEX statement ?

If we create a statement then it needs to work with all kinds of tables etc... Doesn't seem the code is as generic.

@aditi-pandit
Copy link
Copy Markdown
Contributor

@skyelves : Thanks for this work. It might be good if you can write a basic RFC for this. It is quite a complex piece of work that is adding new syntax etc, and also we want to work with Iceberg and specific vector indexing libraries on our side as well, so it would be good to clear out that interface.

Copy link
Copy Markdown
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .md file appears to be the same file as in PR #27027 .

I feel like I would expect to see CREATE VECTOR INDEX documentation added in this PR, in the form of a .rst file in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/sql.

skyelves added a commit to skyelves/presto that referenced this pull request Feb 19, 2026
prestodb#27036)

Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 19, 2026
prestodb#27036)

Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 19, 2026
prestodb#27036)

Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 19, 2026
prestodb#27036)

Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 19, 2026
Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 19, 2026
Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 19, 2026
Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 24, 2026
Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Feb 24, 2026
Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
distanceMetric, indexOptions, partitionedByJson.toString());

// Build synthetic query: SELECT create_vector_index('source_table', 'col1', 'col2', 'type', 'props')
// No FROM clause — the Python script handles all data access, no table scan needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since vector index creation requires scanning table data (e.g., embedding columns), delegating all data access to the Python UDF without a visible table scan prevents the analyzer from registering read dependencies on the indexed table and enforcing column-level SELECT privileges during index build. This effectively indicates that the index build process accesses the underlying data source outside Presto’s planning and execution framework, which may lead to inconsistencies with snapshot isolation, partition pruning, predicate pushdown, resource group enforcement, and access control checks. To maintain governance and consistency guarantees expected in a lakehouse execution model, it would be preferable for index build operations to execute within Presto’s planning and scheduling framework, with the underlying table scan represented in the analysed query similar to CTAS.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the noise. This is actually a bug and I just fixed it. The analysed query should be similiar to CTAS

skyelves added a commit to skyelves/presto that referenced this pull request Mar 13, 2026
Summary:

## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**

3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.

Differential Revision: D91524358
skyelves pushed a commit to skyelves/presto that referenced this pull request Mar 13, 2026
Summary: Pull Request resolved: prestodb#27036

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Mar 13, 2026
Summary:
Pull Request resolved: prestodb#27036

## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**

3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.

Differential Revision: D91524358
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 13, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: skyelves / name: Ke Wang (98cf096)

skyelves pushed a commit to skyelves/presto that referenced this pull request Mar 13, 2026
Summary:
Pull Request resolved: prestodb#27036

## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**

3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Mar 13, 2026
Summary:
Pull Request resolved: prestodb#27036

## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**

3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.

Differential Revision: D91524358
skyelves added a commit to skyelves/presto that referenced this pull request Mar 13, 2026
Summary:

## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**

3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.

Differential Revision: D91524358
@gggrace14 gggrace14 self-requested a review March 16, 2026 23:05
gggrace14
gggrace14 previously approved these changes Mar 16, 2026
Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skyelves for this code. Please can you add some tests in TestAnalyzer class for the StatementAnalyzer code.

Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests.

@steveburnett
Copy link
Copy Markdown
Contributor

Please add a release note - or NO RELEASE NOTE - following the Release Notes Guidelines to pass the failing but not required CI check.

skyelves added a commit to skyelves/presto that referenced this pull request Mar 17, 2026
Summary:

## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**

3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.


## Release Notes
Please follow release notes guidelines and fill in the release notes below.
```
  == RELEASE NOTES ==
  General Changes
  * Add support for create-vector-index statement, which creates
    vector search indexes on table columns with configurable index properties
    and partition filtering via an ``UPDATING FOR`` clause.
```

Differential Revision: D91524358
@skyelves
Copy link
Copy Markdown
Member Author

Please add unit tests.

Thanks, added some tests. Could you take a another look?

@skyelves
Copy link
Copy Markdown
Member Author

Please add a release note - or NO RELEASE NOTE - following the Release Notes Guidelines to pass the failing but not required CI check.

added

aditi-pandit
aditi-pandit previously approved these changes Mar 17, 2026
Copy link
Copy Markdown
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests.

public static final class CreateVectorIndexAnalysis
{
private final QualifiedObjectName sourceTableName;
private final QualifiedObjectName targetTableName;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index artifact is currently represented using QualifiedObjectName, similar to a table. Since vector indexes may also be implemented as connector-managed artifacts (e.g., external index files or metadata entries), it would be better to treat this as a logical index identifier rather than strictly a physical table. This would allow connectors to map the index name to their own storage model while keeping the engine abstraction consistent across different implementations.

Map<String, ColumnHandle> sourceColumns = metadataResolver.getColumnHandles(sourceTableHandle);
for (Identifier column : node.getColumns()) {
if (!sourceColumns.containsKey(column.getValue())) {
throw new SemanticException(MISSING_COLUMN, column, "Column '%s' does not exist in source table '%s'", column.getValue(), sourceTableName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current validation ensures that the specified columns exist in the source table, which is good. However, since the syntax allows either (embedding) or (row_id, embedding), it would be helpful to also validate the column structure. If only one column is provided, it should be validated as an embedding column rather than a row identifier. Additionally, when two columns are specified, they should follow the (row_id, embedding) order(optional). This validation can help prevent invalid cases like (id) from passing analysis and failing later during index creation.

feilong-liu
feilong-liu previously approved these changes Mar 18, 2026
@meta-codesync meta-codesync bot changed the title feat: Add analysis support for CREATE VECTOR INDEX (#27036) feat: Add analysis support for CREATE VECTOR INDEX (#27036) (#27036) Mar 18, 2026
throw new SemanticException(MISSING_TABLE, node, "Source table '%s' does not exist", sourceTableName);
}

QualifiedObjectName targetTable = createQualifiedObjectName(session, node, node.getIndexName(), metadata);
Copy link
Copy Markdown
Contributor

@gggrace14 gggrace14 Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @skyelves , here you're creating targetTable of type QualifiedObjectName from node.getIndexName() of type QualifiedName, are you referencing to some example? Is it okay to keep using QualifiedName at the Analyzer layer? Can you help check? Quick look at the Analysis class shows several QualifiedName used there.

Summary:
## High level design
The process for executing a CREATE VECTOR INDEX SQL statement is as follows:
1. SQL Input & Parsing:

SQL: CREATE VECTOR INDEX my_index ON my_table(id, embedding) WITH (...) UPDATING FOR ...
The Parser (SqlBase.g4) generates a CreateVectorIndex Abstract Syntax Tree (AST) node.
2. Statement Analysis:

**StatementAnalyzer.visitCreateVectorIndex() validates the source/target tables and extracts index properties.**
**This results in a structured CreateVectorIndexAnalysis object.**

3. Logical Planning & Query Generation:
• LogicalPlanner.createVectorIndexPlan() builds the core execution query:
CREATE index_table AS SELECT create_vector_index(embedding, id) FROM my_table WHERE ds BETWEEN ...
• The resulting plan tree includes:

TableFinishNode(target = CreateVectorIndexReference)
└── TableWriterNode(target = CreateVectorIndexReference)
└── query plan
4. Connector Plan Optimization (Rewriting):

PRISM: The CreateVectorIndexRewriteOptimizer detects the CreateVectorIndexReference and rewrites the plan for optimization.
ICEBERG/OTHER: Other connector-specific optimizers may fire during this phase.
5. Execution and Metadata Handling (For connectors that don't rewrite):

TableWriteInfo Routing: The CreateVectorIndexReference triggers metadata.beginCreateVectorIndex().
Local Execution & Commit: The finisher and committer use the CreateVectorIndexHandle to call metadata.finishCreateVectorIndex() and metadata.commitPageSinkAsync().
6. ConnectorMetadata SPI:

Default: The standard implementation throws NOT_SUPPORTED.
Iceberg Override: The Iceberg connector implements this SPI to create the underlying table via the begin/finish calls.


## Release Notes
```
== NO RELEASE NOTE ==
```



Differential Revision: D91524358

Pulled By: skyelves
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants