Conversation
44b3f03 to
6b4e9f7
Compare
|
Thanks, @gggrace14 , for the write-up The proposal currently does not describe the lifecycle management of vector indexes with respect to underlying table mutations. In a lakehouse environment such as Iceberg, operations like appends, partition backfills, deletes, file rewrites (compaction), and snapshot expiration can make the index stale or physically invalid, especially if the index stores file-level row references. Could we clarify how index validity is expected to be maintained across such operations? Specifically:
This would help ensure correctness and avoid cases where ANN search may return outdated or invalid results. |
| "I need the index definition created now, but the index should only build when I instruct." | ||
|
|
||
| ```sql | ||
| CREATE VECTOR INDEX vector_index ON TABLE candidates_table(id, embedding) WITH ( |
There was a problem hiding this comment.
After creating the vector index metadata without building it for any partitions, the expected query behavior is not entirely clear. For example, if a user runs:
SELECT *
FROM candidates_table
ORDER BY embedding <-> query_vector
LIMIT 10;
when the index is defined but not yet populated, should the engine fall back to a full scan, ignore the index, emit a warning, or fail the query? It would be helpful to clarify the intended behavior in this scenario to ensure predictable query execution.
There was a problem hiding this comment.
It is by current design for the RFC to leave it to individual connector defining the behavior, when the index only has schema but has no data.
For example, at meta internal the current strategy is to have a condition, and build the index on the fly if the condition is met, and fall back to full scan otherwise. And we put this strategy into a ConnectorPlanOptimizer subclass.
I think it's too early to define such behavior or any behavior in SPI, and let every connector respect this behavior. A better approach is to let individual connector contribute the implementation first, and we extract the common behavior into SPI afterwards.
|
|
||
| **ON `TABLE candidates_table(id, embedding)` clause** | ||
| - `embedding` is the embedding column to create the index on. | ||
| - `id` is the unique identifier column that the user needs to provide. |
There was a problem hiding this comment.
It would be beneficial to keep the PRIMARY KEY / ROW_ID specification optional rather than mandatory for vector index creation.
From the Iceberg V3 perspective, tables with Row Level Lineage enabled can provide a stable row-level identifier via $row_id. In such cases, the identifier column can be made optional, allowing the connector to automatically fall back to $row_id when a user-defined id is not specified.
This would avoid requiring users to explicitly provide a unique identifier for tables where lineage-based row identity is already available.
There was a problem hiding this comment.
Hi @NivinCS , sure, make sense. I will change rfc to make the id field optional.
| - Alternatively, use the hidden column $row_id for id and remove the requirement from users. | ||
|
|
||
| **`UPDATING FOR` clause** | ||
| - For each of the index partitions mapped to the range, an index will be created. |
There was a problem hiding this comment.
It may be beneficial to make the partition specification in UPDATE VECTOR INDEX (e.g., FOR PARTITION (...)) optional rather than mandatory.
Instead of requiring users to manually specify partition ranges for index updates, the connector could automatically detect changed partitions and perform incremental index refresh, similar to materialized view incremental refresh.
This would reduce the need for manual orchestration and improve usability for incremental index maintenance.
cc : @tdcmeehan
There was a problem hiding this comment.
By design UPDATE VECTOR INDEX vector_index WHERE <filter> should be the Presto interface for incremental index refresh, similar to the REFRESH MATERIALIZED VIEW ... WHERE <filter>. I think the auto detection of partition change to the base table should be out of the scope of Presto. For example, we could possibly have a pub-sub service out of Presto that auto detects based on the base table metadata. Once the change to base table partition is detected, any auto mechanism could assemble the and call UPDATE VECTOR INDEX to instruct Presto to refresh the index increment.
|
|
||
| #### OSS Implementation | ||
|
|
||
| OSS implementation will be exact vector search and implement a ConnectorPlanOptimizer to rewrite to a plan with MAX_BY(). |
There was a problem hiding this comment.
It will be beneficial to include a reusable toolkit/library as part of the proposal to handle internal query rewrites for vector search operations.
In many cases, executing ANN search would otherwise require users to explicitly join the base table with the corresponding vector index using a row identifier (e.g., user-defined id or $row_id). This can expose index implementation details and make query syntax more complex, requiring users to understand how embeddings and index data are related internally.
Providing a reusable component that transparently rewrites vector search predicates (e.g., ANN_SEARCH(...)) into the appropriate join between the base table and vector index would enable a simpler and more seamless user experience. This would allow users to express ANN queries without manually specifying join logic, while the underlying connector leverages the index automatically.
Such a library could be integrated by connectors (e.g., Iceberg) to perform these rewrites behind the scenes, avoiding duplication of rewrite logic across connectors and keeping index implementation details abstracted from end users.
cc : @tdcmeehan
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @gggrace14 for this writeup. Overall I'm in favor of this proposal.
I'm also curious since FAISS functions were recently exposed in C++.... Are you planning to do any follow up work with the library ?
|
|
||
| ##### Journey 4: Search with implicit index lookup | ||
|
|
||
| "I want to run ANN search automatically using the index registered on my table." |
There was a problem hiding this comment.
Does the optimizer know about the Index as such ? The use of the index should be abstracted by VECTOR_SEARCH function, no ?
There was a problem hiding this comment.
Whether the candidates table has an index or not would be maintained by metadata, which is accessible to optimizer. If the index is available, the optimizer could pick it up and rewrite the VECTOR_SEARCH query.
|
|
||
| #### Update Partitioned Index | ||
|
|
||
| ```sql |
There was a problem hiding this comment.
Are we missing the SPI for these ? Can you elaborate ?
| ```sql | ||
| SET SESSION vector_search_index = 'di:vector_index:nprobe=50'; | ||
|
|
||
| SELECT |
There was a problem hiding this comment.
Have you considered implementing VECTOR_SEARCH with a TABLE function ? It is easier to express the input tables, partitioning and parameters with it.
There was a problem hiding this comment.
Hi @aditi-pandit , yes, we do plan to implement VECTOR_SEARCH as a TABLE function, on top of the table function framework.
Yea, also thought to follow up. In the prior TSC meeting, you mentioned that you have some experience or resources that you could share with us to implement the VECTOR_SEARCH function as a table function. Would you kindly provide more details? Thank you!
|
|
||
| #### Execution and SPI | ||
|
|
||
| Our proposal is to introduce generic SPI nodes: VectorSearchNode and VectorSearchOptions, which together provide an abstract interface for vector search. This abstract interface will only include mandatory parameters, allowing actual implementations to extend it. |
There was a problem hiding this comment.
Can you give more details about VectorSearchNode and VectorSearchOptions. Will VectorSearchNode be a PlanNode ?
There was a problem hiding this comment.
Sorry, Aditi, I think this paragraph is outdated. I'm removing this paragraph.
| - The individual table partition is large. | ||
| - A new partition is added every day or every hour. | ||
| - Past partitions are occasionally backfilled. | ||
|
|
There was a problem hiding this comment.
It would be helpful to clarify the behavior when base table partitions are dropped, since index partitions are mapped to them. Should corresponding index partitions be automatically cleaned up or require manual action?
Additionally, it may be beneficial to define lifecycle DDL such as DROP VECTOR INDEX for operational usability.
There was a problem hiding this comment.
Yea, I'm adding the DDL DROP VECTOR INDEX. A good suggestion.
I'm changing this RFC to add a getVectorSearchIndexStatus() method to the SPI class MetadataManager and ConnectorMetadata. Then the search path could decide whether to consume the index partitions given the index status.
Regarding the behavior when the base table partitions are dropped, I think we should leave to users to decide whether to drop the index partitions or not. In some case, the index building is quite expensive, and the user might not want to drop the index partitions until the index quality does not meet certain criteria. We could make suggestion in the RFC. At meta, currently we drop the index partitions. I'm also adding this clarification to the RFC.
| Update only the index partition mapped to the base table partition ds = '2026-01-04'. | ||
|
|
||
| ```sql | ||
| UPDATE VECTOR INDEX vector_index |
There was a problem hiding this comment.
It would be helpful to clarify query behavior during index build or refresh. For example, can queries read partially built partitions?
There was a problem hiding this comment.
We can add a getVectorSearchIndexStatus() method to the SPI class MetadataManager and ConnectorMetadata. It's better to leave it to the individual ConnectorPlanOptimizer to decide whether to pick up the partially built index partitions or now.
| CREATE VECTOR INDEX vector_index ON TABLE candidates_table(id, embedding) WITH ( | ||
| index_type = 'ivf_rabitq4', | ||
| distance_metric = 'cosine', | ||
| index_options = 'nlist=100000,nb=8', |
There was a problem hiding this comment.
what do the parameters nlist and nb represent?
- How do these parameters affect indexing performance, recall, and query latency?
- What guidelines should users follow to choose optimal values for different dataset sizes and vector dimensions?
- Is there any automatic tuning mechanism available, or are users expected to benchmark and manually determine the best configuration?
There was a problem hiding this comment.
Hi @Dilli-Babu-Godari , generally index_options is particular to individual index_type. And each connector is responsible for checking if the index_type is supported and if the index_options is valid for the index_type. (I'm adding this detail to this RFC)
Specifically,
1. How do these parameters affect indexing performance, recall, and query latency?
nlist is one option for any ivf_* index, which is the # of centroids. The typical trade-off around the # of centroids also hold here. Larger nlist means smaller cluster size. If the data distribution itself is good, it could largely reduce the search space, and thus reduce the amount of records loaded to each node as well as memory pressure. If the data distribution is not good, then we need to increase nprobe.
nb is one option for the index RabitQ. Higher nb provides better accuracy at the cost of additional storage, as in https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#if-maximum-compression-is-required-then-rabitq.
2. What guidelines should users follow to choose optimal values for different dataset sizes and vector dimensions?
For nlist, we can use the formula here, https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#if-below-1m-vectors-ivfk.
For nb, more bits means better accuracy but at the cost of memory.
3. Is there any automatic tuning mechanism available, or are users expected to benchmark and manually determine the best configuration?
At the moment, we derive default values in code for the options, if possible. This index_options parameter is what the user can use to customize the options. Automatic tuning is next next step, which needs the index quality feedback loop.
Hi @NivinCS , in general it is designed to define the minimum required parts to be SPI and leave individual connectors with the max flexibility to define its own behavior. Specifically with respect to how index validity is maintained, we could leave it to individual ConnectorPlanOptimizer to detect if the index partition is stale and return reasonable result when the ConnectorPlanOptimizer plans/rewrites the vector search query. For SPI, thinking through your questions, I think what could be added here is a I could describe below what we will put into the HiveConnector to maintain the index partition validity at Meta, by answering your above questions. Meta has an internal HiveConnector extension. I think you can implement the Iceberg behaviors in the counterpart classes of IcebergConnector. Again I think these should belong to the connectors but not SPI. 1. How is index staleness detected? We'll have a pub-sub service that subscribes to the metadata of the candidate table (base table). If the corresponding base table partition is updated/overwritten/deleted, it will trigger the dropping of the index partition. Then in the search path, we have a VectorSearchRewritePlanOptimizer that implements a ConnectorPlanOptimizer. In the VectorSearchRewritePlanOptimizer, we compare the available partitions of the index vs the base table in the query range. If an index partition is available, we'll use it to get the kNN of the partition. Otherwise, we will build the index on the fly. Another option for an unavailable index partition is to fall back to the exact search. And then we'll aggregate across all partitions in the query range and compute the kNN from all kNNs from each partion. 2. Are indexes expected to be snapshot-bound? 3. How should index cleanup or rebuild be handled after file rewrites or snapshot expiry? 4. Is automatic invalidation or maintenance of index metadata in scope? |
jja725
left a comment
There was a problem hiding this comment.
Hi @gggrace14 , thanks for the proposal and great to see you again in the community! This definitely would help presto involve in the AI/ML workloads. I recently worked on lance connector and and would be happy to jump in and help with the vector search integration!
| "I need to populate the vector index for 2026-01-04." | ||
|
|
||
| ```sql | ||
| UPDATE VECTOR INDEX vector_index WHERE |
There was a problem hiding this comment.
UPDATE VECTOR INDEX: This looks like a DML UPDATE but is actually a rebuild/refresh operation. REFRESH VECTOR INDEX or REBUILD VECTOR INDEX would be clearer and avoid confusion with
standard SQL UPDATE semantics.
There was a problem hiding this comment.
Hi @jja725 , I'm open with REFRESH VECTOR INDEX or REBUILD VECTOR INDEX, if no one else objects to. I could change this RFC to use REFRESH VECTOR INDEX.
| - Add an AST `CreateVectorIndex` node that extends `Statement`. | ||
|
|
||
| ```java | ||
| public class CreateTableAsSelect extends Statement { } |
There was a problem hiding this comment.
You're right. Correcting the type now.
|
|
||
| ```sql | ||
| UPDATE VECTOR INDEX vector_index WHERE | ||
| ds = '2026-01-04' |
There was a problem hiding this comment.
This is confusing — does the WHERE clause here replace the original filter, extend it, or is it ignored? Why we need this filter if it's unpartitioned
There was a problem hiding this comment.
I think the confusing part is that UPDATE VECTOR INDEX sounds like an update to the index schema. Here the semantics is actually to refresh the index partition ds = '2026-01-04'. I can change the UPDATE VECTOR INDEX to REFRESH VECTOR INDEX, same to the reply to your previous comment. By design, we don't support mutating the index schema, or replace vector index.
| Unique to Presto; critical for large-scale warehouse workloads with daily partitions | ||
|
|
||
| 3. **Pre-filtering advantage** | ||
| WHERE before search enables partition pruning (BigQuery lacks this) |
There was a problem hiding this comment.
https://docs.cloud.google.com/bigquery/docs/vector-index#pre-filters_and_post-filters bigquery actually have pre-filter seems like
Hi @aditi-pandit , thank you for reviewing this RFC in the TSC meeting and here. Yes, at meta so far we've accumulated implementation for the HiveConnector on top of the FAISS library, for both the search path and |
This RFC proposes the syntax and SPI to natively support Approximate Nearest Neighbor (ANN) vector search within Presto.