Skip to content

Commit

Permalink
validate-index: Implement a function to validate index data structures (
Browse files Browse the repository at this point in the history
#208)

* validate-index: Implement a function to validate index data structures

Example:

```
CREATE EXTENSION lantern;
CREATE TABLE small_world (
    id SERIAL PRIMARY KEY,
    v REAL[2]
);
INSERT INTO small_world (v) VALUES ('{0,0,1}'), ('{0,1,0}');
CREATE INDEX ON small_world USING hnsw (v);

SELECT _lantern_internal.validate_index('small_world_v_idx');
```

The output of the last command:
```
INFO:  validate_index() start for small_world_v_idx
INFO:  index_header = HnswIndexHeaderPage(version=1 vector_dim=3 m=16 ef_construction=128 ef=64 metric_kind=1 num_vectors=2 last_data_block=2 blockmap_page_groups=0)
INFO:  blocks_nr=3 nodes_nr=2
INFO:  blocks for: header 1 blockmap 1 nodes 1
INFO:  nodes per block: last block 2
INFO:  level=0: nodes 2 directed neighbor edges 2 min neighbors 1 max neighbors 1
INFO:  validate_index() done, no issues found.
 validate_index
----------------

(1 row)
```

To see the indexes that could be passed to the function:
```
postgres=# \d small_world;
                            Table "public.small_world"
 Column |  Type   | Collation | Nullable |                 Default
--------+---------+-----------+----------+-----------------------------------------
 id     | integer |           | not null | nextval('small_world_id_seq'::regclass)
 v      | real[]  |           |          |
Indexes:
    "small_world_pkey" PRIMARY KEY, btree (id)
    "small_world_v_idx" hnsw (v)
```

This patch also adds the validate_index() call to existing tests.
Because of use of RNG in hnsw_generate_new_level() the number of levels
in the newly INSERTed nodes is not deterministic, and validate_index()
output may change between runs, because it prints the number of nodes
for each level. If you see a sporadic test failures due to different
validate_index() info output please remove the validate_index() call
from the test.
Another solution would be to add an option validate_index() to tell if
elog() for the additional info is needed.

* src/hnsw/validate_index: run clang-format

* src/hnsw/validate_index: use signed batch_size and group_node_first_index
They are compared and are used in the same expressions as other unsigned variables anyway.
There is no good reason for them to be signed.

* src/hnsw/validate_index: change PRIu64 to ul

Reference: https://gitlab.com/wireshark/wireshark/-/issues/17895

* src/hnsw/validate_index: remove dangling " " after clang-format

* src/hnsw/validate_index: include access/heapam.h instead of access/relation.h for PostgreSQL 11

* src/hnsw/validate_index: clang-format

* src/hnsw/validate_index: make elog(INFO, ...) prints optional and enabled by default

This is required because some tests are building the HNSW index in a non-deterministic way.

* test: make validate_index() output deterministic

* src/hnsw/validate_index: use ldb_invariant() instead of assert()

* src/hnsw/validate_index: reduce the scope of what's done in LDB_VI_READ_NODE_CHUNK() macro

* src/hnsw/validate_index: validate vn_dim properly

* src/hnsw/validate_index: add a comment about assumptions and storage format for struct ldb_vi_node

* src/hnsw/validate_index: describe what vi here is

* src/hnsw/validate_index: cast ldb_HnswGetM() to uint32 to compare with HnswIndexHeaderPage.m

* use FirstOffsetNumber and OffsetNumberNext() in the loop over page
  • Loading branch information
medvied authored Oct 25, 2023
1 parent 05336cf commit 8788729
Show file tree
Hide file tree
Showing 26 changed files with 1,003 additions and 7 deletions.
4 changes: 4 additions & 0 deletions sql/lantern.sql
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ CREATE OPERATOR <-> (
);

CREATE SCHEMA _lantern_internal;

CREATE FUNCTION _lantern_internal.validate_index(index regclass, print_info boolean DEFAULT true) RETURNS VOID
AS 'MODULE_PATHNAME', 'lantern_internal_validate_index' LANGUAGE C STABLE STRICT PARALLEL UNSAFE;

-- operator classes
CREATE OR REPLACE FUNCTION _lantern_internal._create_ldb_operator_classes(access_method_name TEXT) RETURNS BOOLEAN AS $$
DECLARE
Expand Down
11 changes: 11 additions & 0 deletions src/hnsw.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
#include "hnsw/options.h"
#include "hnsw/scan.h"
#include "hnsw/utils.h"
#include "hnsw/validate_index.h"
#include "hnsw/vector.h"
#include "usearch.h"

Expand Down Expand Up @@ -358,6 +359,16 @@ Datum vector_l2sq_dist(PG_FUNCTION_ARGS)
PG_RETURN_FLOAT8((double)vector_dist(a, b, usearch_metric_l2sq_k));
}

PGDLLEXPORT PG_FUNCTION_INFO_V1(lantern_internal_validate_index);
Datum lantern_internal_validate_index(PG_FUNCTION_ARGS)
{
Oid indrelid = PG_GETARG_OID(0);
bool print_info = PG_GETARG_BOOL(1);

ldb_validate_index(indrelid, print_info);
PG_RETURN_VOID();
}

/*
* Get data type for give oid
* */
Expand Down
687 changes: 687 additions & 0 deletions src/hnsw/validate_index.c

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions src/hnsw/validate_index.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#ifndef LDB_HNSW_VALIDATE_INDEX_H
#define LDB_HNSW_VALIDATE_INDEX_H

#include <postgres.h>

/*
* This function checks integrity of the data structures in the index relation.
*
* How it works:
* - it creates ldb_vi_block for each block of the index relation and
* ldb_vi_node for each node inside the index relation;
* - it loads all blockmap groups and analyzes mappings between nodes and
* blocks;
* - it loads all the nodes with their neighbors;
* - it also prints statistics about blocks and nodes, which is useful for
* understanding of what's inside the index;
* - it assumes that PostgreSQL-level data structures are intact (i.e. the page
* header and the mapping between offsets and items is correct for each page);
* - in case if a corruption of the data structure is found the function prints
* an error message with details about the place and surrounding data
* structures.
*/
void ldb_validate_index(Oid indrelid, bool print_info);

#endif
25 changes: 21 additions & 4 deletions test/expected/ext_relocation.out
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,15 @@ WHERE d.deptype = 'e' AND e.extname = 'lantern'
ORDER BY 1, 3;
extschema | proname | proschema
-----------+------------------------------+-------------------
schema1 | validate_index | _lantern_internal
schema1 | _create_ldb_operator_classes | _lantern_internal
schema1 | ldb_generic_dist | schema1
schema1 | ldb_generic_dist | schema1
schema1 | l2sq_dist | schema1
schema1 | hnsw_handler | schema1
schema1 | cos_dist | schema1
schema1 | hamming_dist | schema1
schema1 | l2sq_dist | schema1
(7 rows)
schema1 | cos_dist | schema1
schema1 | ldb_generic_dist | schema1
(8 rows)

-- show all the extension operators
SELECT ne.nspname AS extschema, op.oprname, np.nspname AS proschema
Expand Down Expand Up @@ -70,6 +71,14 @@ CREATE INDEX hnsw_index ON small_world USING hnsw(v) WITH (dim=3);
INFO: done init usearch index
INFO: inserted 8 elements
INFO: done saving 8 vectors
SELECT _lantern_internal.validate_index('hnsw_index', false);
INFO: validate_index() start for hnsw_index
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

\set ON_ERROR_STOP off
-- lantern does not support relocation.
-- Postgres will not allow it to support this since its objects span over more than one schema
Expand Down Expand Up @@ -99,6 +108,14 @@ CREATE INDEX hnsw_index2 ON small_world USING hnsw(v) WITH (dim=3);
INFO: done init usearch index
INFO: inserted 8 elements
INFO: done saving 8 vectors
SELECT _lantern_internal.validate_index('hnsw_index2', false);
INFO: validate_index() start for hnsw_index2
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

\set ON_ERROR_STOP off
-- extension function cannot be found without schema-qualification
SELECT l2sq_dist(ARRAY[1.0, 2.0, 3.0], ARRAY[4.0, 5.0, 6.0]);
Expand Down
9 changes: 9 additions & 0 deletions test/expected/hnsw_config.out
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,12 @@ SHOW hnsw.init_k;
10
(1 row)

-- Validate the index data structures
SELECT _lantern_internal.validate_index('small_world_v_idx', false);
INFO: validate_index() start for small_world_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

9 changes: 9 additions & 0 deletions test/expected/hnsw_correct.out
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,12 @@ WHERE
---------+---------------+------------------+-----------------+--------------------
(0 rows)

-- Validate the index data structures
SELECT _lantern_internal.validate_index('small_world_v_idx', false);
INFO: validate_index() start for small_world_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

32 changes: 32 additions & 0 deletions test/expected/hnsw_cost_estimate.out
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,14 @@ DEBUG: LANTERN - ---------------------
t
(1 row)

SELECT _lantern_internal.validate_index('empty_idx', false);
INFO: validate_index() start for empty_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

DROP INDEX empty_idx;
-- Case 1, more data in index.
-- Should see higher cost than Case 0.
Expand All @@ -89,6 +97,14 @@ DEBUG: LANTERN - ---------------------
t
(1 row)

SELECT _lantern_internal.validate_index('hnsw_idx', false);
INFO: validate_index() start for hnsw_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

DROP INDEX hnsw_idx;
-- Case 2, higher M.
-- Should see higher cost than Case 1.
Expand All @@ -109,6 +125,14 @@ DEBUG: LANTERN - ---------------------
t
(1 row)

SELECT _lantern_internal.validate_index('hnsw_idx', false);
INFO: validate_index() start for hnsw_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

DROP INDEX hnsw_idx;
-- Case 3, higher ef.
-- Should see higher cost than Case 2.
Expand All @@ -129,4 +153,12 @@ DEBUG: LANTERN - ---------------------
t
(1 row)

SELECT _lantern_internal.validate_index('hnsw_idx', false);
INFO: validate_index() start for hnsw_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

DROP INDEX hnsw_idx;
24 changes: 24 additions & 0 deletions test/expected/hnsw_create.out
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,14 @@ SELECT * FROM ldb_get_indexes('sift_base1k');
sift_base1k_v_idx | 632 kB | CREATE INDEX sift_base1k_v_idx ON public.sift_base1k USING hnsw (v) WITH (dim='128', m='4') | 632 kB
(1 row)

SELECT _lantern_internal.validate_index('sift_base1k_v_idx', false);
INFO: validate_index() start for sift_base1k_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

-- Validate that index creation works with a larger number of vectors
\ir utils/sift10k_array.sql
CREATE TABLE IF NOT EXISTS sift_base10k (
Expand All @@ -54,6 +62,14 @@ EXPLAIN (COSTS FALSE) SELECT * FROM sift_base10k order by v <-> :'v4444' LIMIT 1
Order By: (v <-> '{55,61,11,4,5,2,13,24,65,49,13,9,23,37,94,38,54,11,14,14,40,31,50,44,53,4,0,0,27,17,8,34,12,10,4,4,22,52,68,53,9,2,0,0,2,116,119,64,119,2,0,0,2,30,119,119,116,5,0,8,47,9,5,60,7,7,10,23,56,50,23,5,28,68,6,18,24,65,50,9,119,75,3,0,1,8,12,85,119,11,4,6,8,9,5,74,25,11,8,20,18,12,2,21,11,90,25,32,33,15,2,9,84,67,8,4,22,31,11,33,119,30,3,6,0,0,0,26}'::real[])
(3 rows)

SELECT _lantern_internal.validate_index('hnsw_idx', false);
INFO: validate_index() start for hnsw_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

--- Validate that M values inside the allowed range [2, 128] do not throw an error
CREATE INDEX ON small_world USING hnsw (v) WITH (M=2);
INFO: done init usearch index
Expand Down Expand Up @@ -117,6 +133,14 @@ INSERT INTO small_world4 (id, vector) VALUES
('000', '{1,0,0,0}'),
('001', '{1,0,0,1}'),
('010', '{1,0,1,0}');
SELECT _lantern_internal.validate_index('small_world4_hnsw_idx', false);
INFO: validate_index() start for small_world4_hnsw_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

-- without the index, I can change the dimension of a vector element
DROP INDEX small_world4_hnsw_idx;
UPDATE small_world4 SET vector = '{0,0,0}' WHERE id = '001';
Expand Down
8 changes: 8 additions & 0 deletions test/expected/hnsw_create_expr.out
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,14 @@ CREATE INDEX ON test_table USING hnsw (int_to_fixed_binary_real_array(id)) WITH
INFO: done init usearch index
INFO: inserted 3 elements
INFO: done saving 3 vectors
SELECT _lantern_internal.validate_index('test_table_int_to_fixed_binary_real_array_idx', false);
INFO: validate_index() start for test_table_int_to_fixed_binary_real_array_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

\set ON_ERROR_STOP off
-- This should result in an error that dimensions does not match
CREATE INDEX ON test_table USING hnsw (int_to_dynamic_binary_real_array(id)) WITH (M=2);
Expand Down
32 changes: 32 additions & 0 deletions test/expected/hnsw_dist_func.out
Original file line number Diff line number Diff line change
Expand Up @@ -239,3 +239,35 @@ SELECT ROUND(hamming_dist(v, '{0,0}')::numeric, 2) FROM extra_small_world_ham OR
4.00
(4 rows)

SELECT _lantern_internal.validate_index('small_world_l2_v_idx', false);
INFO: validate_index() start for small_world_l2_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

SELECT _lantern_internal.validate_index('small_world_cos_v_idx', false);
INFO: validate_index() start for small_world_cos_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

SELECT _lantern_internal.validate_index('small_world_ham_v_idx', false);
INFO: validate_index() start for small_world_ham_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

SELECT _lantern_internal.validate_index('extra_small_world_ham_v_idx', false);
INFO: validate_index() start for extra_small_world_ham_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

24 changes: 24 additions & 0 deletions test/expected/hnsw_index_from_file.out
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,14 @@ CREATE INDEX hnsw_l2_index ON sift_base1k USING hnsw (v) WITH (_experimental_ind
INFO: done init usearch index
INFO: done loading usearch index
INFO: done saving 1000 vectors
SELECT _lantern_internal.validate_index('hnsw_l2_index', false);
INFO: validate_index() start for hnsw_l2_index
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

SELECT * FROM ldb_get_indexes('sift_base1k');
indexname | size | indexdef | total_index_size
---------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------+------------------
Expand Down Expand Up @@ -94,6 +102,14 @@ CREATE INDEX hnsw_cos_index ON sift_base1k USING hnsw (v) WITH (_experimental_in
INFO: done init usearch index
INFO: done loading usearch index
INFO: done saving 1000 vectors
SELECT _lantern_internal.validate_index('hnsw_cos_index', false);
INFO: validate_index() start for hnsw_cos_index
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

SELECT * FROM ldb_get_indexes('sift_base1k');
indexname | size | indexdef | total_index_size
----------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------+------------------
Expand Down Expand Up @@ -142,6 +158,14 @@ CREATE INDEX hnsw_l2_index ON sift_base1k USING hnsw (v) WITH (_experimental_ind
INFO: done init usearch index
INFO: done loading usearch index
INFO: done saving 1000 vectors
SELECT _lantern_internal.validate_index('hnsw_l2_index', false);
INFO: validate_index() start for hnsw_l2_index
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

-- This should not throw error, but the first result will not be 0 as vector 777 is deleted from the table
SELECT ROUND(l2sq_dist(v, :'v777')::numeric, 2) FROM sift_base1k order by v <-> :'v777' LIMIT 10;
round
Expand Down
26 changes: 25 additions & 1 deletion test/expected/hnsw_insert.out
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ CREATE INDEX ON small_world USING hnsw (v) WITH (dim=3);
INFO: done init usearch index
INFO: inserted 0 elements
INFO: done saving 0 vectors
SELECT _lantern_internal.validate_index('small_world_v_idx', false);
INFO: validate_index() start for small_world_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

-- Insert rows with valid vector data
INSERT INTO small_world (v) VALUES ('{0,0,1}'), ('{0,1,0}');
INSERT INTO small_world (v) VALUES (NULL);
Expand Down Expand Up @@ -101,6 +109,14 @@ LIMIT 10;
Order By: (v <-> '{0,0,0}'::real[])
(3 rows)

SELECT _lantern_internal.validate_index('small_world_v_idx', false);
INFO: validate_index() start for small_world_v_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

-- Test the index with a larger number of vectors
CREATE TABLE sift_base10k (
id SERIAL PRIMARY KEY,
Expand All @@ -112,10 +128,18 @@ INFO: inserted 0 elements
INFO: done saving 0 vectors
\COPY sift_base10k (v) FROM '/tmp/lantern/vector_datasets/siftsmall_base_arrays.csv' WITH CSV;
SELECT v AS v4444 FROM sift_base10k WHERE id = 4444 \gset
EXPLAIN (COSTS FALSE) SELECT * FROM sift_base10k order by v <-> :'v4444'
EXPLAIN (COSTS FALSE) SELECT * FROM sift_base10k order by v <-> :'v4444';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using hnsw_idx on sift_base10k
Order By: (v <-> '{55,61,11,4,5,2,13,24,65,49,13,9,23,37,94,38,54,11,14,14,40,31,50,44,53,4,0,0,27,17,8,34,12,10,4,4,22,52,68,53,9,2,0,0,2,116,119,64,119,2,0,0,2,30,119,119,116,5,0,8,47,9,5,60,7,7,10,23,56,50,23,5,28,68,6,18,24,65,50,9,119,75,3,0,1,8,12,85,119,11,4,6,8,9,5,74,25,11,8,20,18,12,2,21,11,90,25,32,33,15,2,9,84,67,8,4,22,31,11,33,119,30,3,6,0,0,0,26}'::real[])
(2 rows)

SELECT _lantern_internal.validate_index('hnsw_idx', false);
INFO: validate_index() start for hnsw_idx
INFO: validate_index() done, no issues found.
validate_index
----------------

(1 row)

Loading

0 comments on commit 8788729

Please sign in to comment.