Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query not using HNSW index (counting, faceted search, pagination) #619

Open
javiabellan opened this issue Nov 27, 2024 · 3 comments
Open
Labels
type/question 🙋 Further information is requested

Comments

@javiabellan
Copy link

javiabellan commented Nov 27, 2024

I want to provide search results with more information, like how many documents have been found (counting) and also how many documents belong to other fields (faceted search).

I have solved this use cases in a single query using a CTE. However when i want to include information about how many documents have been found, the query planner decides to not picking the HNSW index resulting in very slow queries.

The question is: How to fix it to use the HNSW index always?

Context:

Table

CREATE TABLE documents (
    id integer PRIMARY KEY,
    src text,
    author text,
    doc_emb vecf16(1000),
);

Index

CREATE INDEX hnsw_doc_index ON documents
USING vectors (doc_emb vecf16_dot_ops)
WITH (options='[indexing.hnsw]
m = 16
ef_construction = 200');

Example of simple query (use HNSW index = yes)

EXPLAIN ANALYZE
WITH cte AS
(
    SELECT *
    FROM documents
    WHERE doc_emb <#> $1 < 0.2
)
(
    SELECT
        'doc_result' AS Section,
        json_build_object( 'id', id, 'src', src ) AS JSON_Value
    FROM cte
    ORDER BY doc_emb <#> $1 asc
    LIMIT 5
    OFFSET 0
);
Limit  (cost=0.00..2.91 rows=5 width=68) (actual time=4.240..4.914 rows=5 loops=1)
  ->  Index Scan using hnsw_doc_index on documents  (cost=0.00..1044978.07 rows=1794161 width=68) (actual time=4.238..4.912 rows=5 loops=1)
        Order By: (doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16)
        Filter: ((doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16) < '0.2'::double precision)
Planning Time: 1.260 ms
Execution Time: 5.040 ms

Visualize query plan

Example with count (use HNSW index = no)

EXPLAIN ANALYZE
WITH cte AS (
    SELECT *
    FROM documents
    WHERE doc_emb <#> $1 < 0.2
)
(
    SELECT
        'doc_result' AS Section,
        json_build_object( 'id', id, 'src', src ) AS JSON_Value
    FROM cte
    ORDER BY doc_emb <#> $1 asc
    LIMIT 5
    OFFSET 0
)
UNION ALL
(
    SELECT
        'num_documents' AS Section,
        json_build_object('count', COUNT(*)) AS JSON_Value
    FROM cte
);
Append  (cost=1110658.65..1151027.38 rows=6 width=64) (actual time=40454.851..40464.023 rows=6 loops=1)
  CTE cte
    ->  Seq Scan on documents  (cost=0.00..1036004.26 rows=1794161 width=1208) (actual time=1.287..39950.807 rows=37641 loops=1)
          Filter: ((doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16) < '0.2'::double precision)
          Rows Removed by Filter: 5378463
  ->  Subquery Scan on "*SELECT* 1_1"  (cost=74654.39..74654.46 rows=5 width=64) (actual time=40454.850..40454.854 rows=5 loops=1)
        ->  Limit  (cost=74654.39..74654.41 rows=5 width=68) (actual time=40454.665..40454.667 rows=5 loops=1)
              ->  Sort  (cost=74654.39..79139.80 rows=1794161 width=68) (actual time=40213.111..40213.112 rows=5 loops=1)
                    Sort Key: ((cte.doc_emb <#> '[0.010192871, -0.001909256, ... , -0.033477783, -0.015335083]'::vecf16))
                    Sort Method: top-N heapsort  Memory: 27kB
                    ->  CTE Scan on cte  (cost=0.00..44854.03 rows=1794161 width=68) (actual time=1.355..40198.036 rows=37641 loops=1)
  ->  Aggregate  (cost=40368.62..40368.64 rows=1 width=64) (actual time=9.162..9.163 rows=1 loops=1)
        ->  CTE Scan on cte cte_1  (cost=0.00..35883.22 rows=1794161 width=0) (actual time=0.001..7.846 rows=37641 loops=1)
Planning Time: 1.464 ms
JIT:
  Functions: 12
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.109 ms, Inlining 53.273 ms, Optimization 104.714 ms, Emission 83.744 ms, Total 242.840 ms
Execution Time: 40486.784 ms

Visualize query plan

Example with count + faceted search (use HNSW index = no)

documents table has author column. In this example i want to count how many documents has each author.

EXPLAIN ANALYZE
WITH cte AS
(
    SELECT *
    FROM documents
    WHERE doc_emb <#> $1 < 0.2
)
(
    SELECT
        'doc_result' AS Section,
        json_build_object( 'id', id, 'src', src ) AS JSON_Value
    FROM cte
    ORDER BY doc_emb <#> $1 asc
    LIMIT 5
    OFFSET 0
)
UNION ALL
(
    SELECT
        'num_documents' AS Section,
        json_build_object('count', COUNT(*)) AS JSON_Value
    FROM cte
)
UNION ALL
(
    -- faceted search info of author field
    SELECT
        'author' AS Section,
        json_build_object(
            'value', author,
            'count', COUNT(*)
        ) AS JSON_Value
    FROM cte
    GROUP BY author
    ORDER BY COUNT(*) DESC
    LIMIT 20
)
-- more faceted search fields here...
;
Append  (cost=1110658.65..1195889.58 rows=26 width=64) (actual time=45792.299..45828.560 rows=26 loops=1)
  CTE cte
    ->  Seq Scan on documents  (cost=0.00..1036004.26 rows=1794161 width=1208) (actual time=1.415..45124.253 rows=37641 loops=1)
          Filter: ((doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16) < '0.2'::double precision)
          Rows Removed by Filter: 5378463
  ->  Subquery Scan on "*SELECT* 1_1"  (cost=74654.39..74654.46 rows=5 width=64) (actual time=45792.293..45792.299 rows=5 loops=1)
        ->  Limit  (cost=74654.39..74654.41 rows=5 width=68) (actual time=45792.231..45792.234 rows=5 loops=1)
              ->  Sort  (cost=74654.39..79139.80 rows=1794161 width=68) (actual time=45449.864..45449.866 rows=5 loops=1)
                    Sort Key: ((cte.doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16))
                    Sort Method: top-N heapsort  Memory: 27kB
                    ->  CTE Scan on cte  (cost=0.00..44854.03 rows=1794161 width=68) (actual time=1.470..45429.721 rows=37641 loops=1)
  ->  Aggregate  (cost=40368.62..40368.64 rows=1 width=64) (actual time=11.863..11.863 rows=1 loops=1)
        ->  CTE Scan on cte cte_1  (cost=0.00..35883.22 rows=1794161 width=0) (actual time=0.002..10.550 rows=37641 loops=1)
  ->  Subquery Scan on "*SELECT* 3"  (cost=44861.85..44862.10 rows=20 width=64) (actual time=24.379..24.383 rows=20 loops=1)
        ->  Limit  (cost=44861.85..44861.90 rows=20 width=104) (actual time=24.365..24.368 rows=20 loops=1)
              ->  Sort  (cost=44861.85..44862.35 rows=200 width=104) (actual time=24.359..24.361 rows=20 loops=1)
                    Sort Key: (count(*)) DESC
                    Sort Method: top-N heapsort  Memory: 28kB
                    ->  HashAggregate  (cost=44854.02..44856.52 rows=200 width=104) (actual time=18.820..23.519 rows=5304 loops=1)
                          Group Key: cte_2.author
                          Batches: 1  Memory Usage: 737kB
                          ->  CTE Scan on cte cte_2  (cost=0.00..35883.22 rows=1794161 width=32) (actual time=0.001..11.229 rows=37641 loops=1)
Planning Time: 1.583 ms
JIT:
  Functions: 21
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.918 ms, Inlining 58.359 ms, Optimization 163.785 ms, Emission 120.338 ms, Total 344.398 ms
Execution Time: 45850.317 ms

Visualize query plan

Side note 1

In all queries i pass the query document embedding as a postgre binary parameter and I use $1 to reference it in the query.

Every query has 2 parts:

  • The filtering part in the common CTE: WHERE doc_emb <#> $1 < 0.2
  • The sorting part in the "doc_results" section: ORDER BY doc_emb <#> $1 asc

I want to know also if the doc_emb <#> $1 computation is reused because i have the same query embedding ($1) in both parts. Or is recomended to precompute a distance column ?

Side note 2

OFFSET 0 is neccesary becase i have pagination and this is page 1, for other pages the OFFSET value will be different. If you came up with other pagination strategies I will appreciete it

@VoVAllen
Copy link
Member

I think the major reason is SELECT * FROM documents WHERE doc_emb <#> $1 < 0.2 cannot go through the index. This is a known limitation of postgres. Our solution is to introduce a new operator sphere. So can you try

SELECT *  FROM documents WHERE doc_emb <<#>> sphere($1, 0.2) ORDER BY doc_emb <#> $1;

in your case? WHERE doc_emb <<#>> sphere($1, 0.2) is used to filter the vector within range 0.2 to make it work with postgres's index.

@VoVAllen
Copy link
Member

pgvector also had a similar discussion on the distance filter problem at pgvector/pgvector#678

@javiabellan
Copy link
Author

javiabellan commented Nov 27, 2024

Thanks, I will try that. I didn't know about the <<#>> operator.


Update

The <<#>> operator did indeed force to use the index, however the whole query is not as fast as expected (takes 2.5s). The same query without counting and without author facet takes 2.569 ms (1000x faster). Any idea to improve the performance ?

Here is the full query:

EXPLAIN ANALYZE
WITH cte AS (
    SELECT *
    FROM documents
    WHERE doc_emb <<#>> sphere('[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16, -0.2)
)
(
    SELECT
        'doc_result' AS Section,
        json_build_object( 'id', id, 'src', src ) AS JSON_Value
    FROM cte
    ORDER BY
        doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16
        asc
    LIMIT 5
    OFFSET 0
)
UNION ALL
(
    SELECT
        'num_documents' AS Section,
        json_build_object('count', COUNT(*)) AS JSON_Value
    FROM cte
)
UNION ALL
(
    -- faceted search info of author field
    SELECT
        'author' AS Section,
        json_build_object(
            'value', author,
            'count', COUNT(*)
        ) AS JSON_Value
    FROM cte
    GROUP BY author
    ORDER BY COUNT(*) DESC
    LIMIT 20
);

Output:

Append  (cost=1122406.54..1251274.29 rows=26 width=64) (actual time=2453.094..2484.139 rows=26 loops=1)
  CTE cte
    ->  Index Scan using hnsw_index on documents  (cost=0.00..1009526.62 rows=2712831 width=1204) (actual time=3.987..1526.692 rows=36742 loops=1)
          Index Cond: (doc_emb <<#>> '("[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]",-0.2)'::sphere_vecf16)
  ->  Subquery Scan on "*SELECT* 1_1"  (cost=112879.92..112879.99 rows=5 width=64) (actual time=2453.092..2453.097 rows=5 loops=1)
        ->  Limit  (cost=112879.92..112879.94 rows=5 width=68) (actual time=2453.060..2453.064 rows=5 loops=1)
              ->  Sort  (cost=112879.92..119662.00 rows=2712831 width=68) (actual time=2147.096..2147.097 rows=5 loops=1)
                    Sort Key: ((cte.doc_emb <#> '[0.010192871, -0.001909256, ..., -0.033477783, -0.015335083]'::vecf16))
                    Sort Method: top-N heapsort  Memory: 26kB
                    ->  CTE Scan on cte  (cost=0.00..67820.78 rows=2712831 width=68) (actual time=4.075..2134.694 rows=36742 loops=1)
  ->  Aggregate  (cost=61038.70..61038.71 rows=1 width=64) (actual time=9.371..9.371 rows=1 loops=1)
        ->  CTE Scan on cte cte_1  (cost=0.00..54256.62 rows=2712831 width=0) (actual time=0.001..8.034 rows=36742 loops=1)
  ->  Subquery Scan on "*SELECT* 3"  (cost=67828.60..67828.85 rows=20 width=64) (actual time=21.657..21.662 rows=20 loops=1)
        ->  Limit  (cost=67828.60..67828.65 rows=20 width=104) (actual time=21.650..21.652 rows=20 loops=1)
              ->  Sort  (cost=67828.60..67829.10 rows=200 width=104) (actual time=21.633..21.634 rows=20 loops=1)
                    Sort Key: (count(*)) DESC
                    Sort Method: top-N heapsort  Memory: 28kB
                    ->  HashAggregate  (cost=67820.78..67823.28 rows=200 width=104) (actual time=16.220..20.829 rows=5197 loops=1)
                          Group Key: cte_2.author
                          Batches: 1  Memory Usage: 737kB
                          ->  CTE Scan on cte cte_2  (cost=0.00..54256.62 rows=2712831 width=32) (actual time=0.001..8.193 rows=36742 loops=1)
Planning Time: 1.276 ms
JIT:
  Functions: 21
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.568 ms, Inlining 55.023 ms, Optimization 141.159 ms, Emission 109.854 ms, Total 307.604 ms
Execution Time: 2505.256 ms

Visualize plan

Captura de pantalla 2024-11-27 a las 20 34 33

Possible ideas of improvement:

  • Make range search faster with sphere operator. I think range search is the issue here because is not well optimized.
  • Choose another index algorithm like ivf or even flat that are better for range search (where thousand of vectors are retrieved) as oppose to hnsw that is designed for fast top-k search (with small k) im not sure about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question 🙋 Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants