-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query not using HNSW index (counting, faceted search, pagination) #619
Comments
I think the major reason is SELECT * FROM documents WHERE doc_emb <<#>> sphere($1, 0.2) ORDER BY doc_emb <#> $1; in your case? |
pgvector also had a similar discussion on the distance filter problem at pgvector/pgvector#678 |
Thanks, I will try that. I didn't know about the UpdateThe Here is the full query: EXPLAIN ANALYZE
WITH cte AS (
SELECT *
FROM documents
WHERE doc_emb <<#>> sphere('[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16, -0.2)
)
(
SELECT
'doc_result' AS Section,
json_build_object( 'id', id, 'src', src ) AS JSON_Value
FROM cte
ORDER BY
doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16
asc
LIMIT 5
OFFSET 0
)
UNION ALL
(
SELECT
'num_documents' AS Section,
json_build_object('count', COUNT(*)) AS JSON_Value
FROM cte
)
UNION ALL
(
-- faceted search info of author field
SELECT
'author' AS Section,
json_build_object(
'value', author,
'count', COUNT(*)
) AS JSON_Value
FROM cte
GROUP BY author
ORDER BY COUNT(*) DESC
LIMIT 20
); Output: Append (cost=1122406.54..1251274.29 rows=26 width=64) (actual time=2453.094..2484.139 rows=26 loops=1)
CTE cte
-> Index Scan using hnsw_index on documents (cost=0.00..1009526.62 rows=2712831 width=1204) (actual time=3.987..1526.692 rows=36742 loops=1)
Index Cond: (doc_emb <<#>> '("[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]",-0.2)'::sphere_vecf16)
-> Subquery Scan on "*SELECT* 1_1" (cost=112879.92..112879.99 rows=5 width=64) (actual time=2453.092..2453.097 rows=5 loops=1)
-> Limit (cost=112879.92..112879.94 rows=5 width=68) (actual time=2453.060..2453.064 rows=5 loops=1)
-> Sort (cost=112879.92..119662.00 rows=2712831 width=68) (actual time=2147.096..2147.097 rows=5 loops=1)
Sort Key: ((cte.doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16))
Sort Method: top-N heapsort Memory: 26kB
-> CTE Scan on cte (cost=0.00..67820.78 rows=2712831 width=68) (actual time=4.075..2134.694 rows=36742 loops=1)
-> Aggregate (cost=61038.70..61038.71 rows=1 width=64) (actual time=9.371..9.371 rows=1 loops=1)
-> CTE Scan on cte cte_1 (cost=0.00..54256.62 rows=2712831 width=0) (actual time=0.001..8.034 rows=36742 loops=1)
-> Subquery Scan on "*SELECT* 3" (cost=67828.60..67828.85 rows=20 width=64) (actual time=21.657..21.662 rows=20 loops=1)
-> Limit (cost=67828.60..67828.65 rows=20 width=104) (actual time=21.650..21.652 rows=20 loops=1)
-> Sort (cost=67828.60..67829.10 rows=200 width=104) (actual time=21.633..21.634 rows=20 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 28kB
-> HashAggregate (cost=67820.78..67823.28 rows=200 width=104) (actual time=16.220..20.829 rows=5197 loops=1)
Group Key: cte_2.author
Batches: 1 Memory Usage: 737kB
-> CTE Scan on cte cte_2 (cost=0.00..54256.62 rows=2712831 width=32) (actual time=0.001..8.193 rows=36742 loops=1)
Planning Time: 1.276 ms
JIT:
Functions: 21
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.568 ms, Inlining 55.023 ms, Optimization 141.159 ms, Emission 109.854 ms, Total 307.604 ms
Execution Time: 2505.256 ms Possible ideas of improvement:
|
I want to provide search results with more information, like how many documents have been found (counting) and also how many documents belong to other fields (faceted search).
I have solved this use cases in a single query using a CTE. However when i want to include information about how many documents have been found, the query planner decides to not picking the HNSW index resulting in very slow queries.
The question is: How to fix it to use the HNSW index always?
Context:
Table
Index
Example of simple query (use HNSW index = yes)
Visualize query plan
Example with count (use HNSW index = no)
Visualize query plan
Example with count + faceted search (use HNSW index = no)
documents
table hasauthor
column. In this example i want to count how many documents has each author.Visualize query plan
Side note 1
In all queries i pass the query document embedding as a postgre binary parameter and I use
$1
to reference it in the query.Every query has 2 parts:
WHERE doc_emb <#> $1 < 0.2
ORDER BY doc_emb <#> $1 asc
I want to know also if the
doc_emb <#> $1
computation is reused because i have the same query embedding ($1
) in both parts. Or is recomended to precompute adistance
column ?Side note 2
OFFSET 0
is neccesary becase i have pagination and this is page 1, for other pages theOFFSET
value will be different. If you came up with other pagination strategies I will appreciete itThe text was updated successfully, but these errors were encountered: