Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate-index: implement a function to validate index data structures #208

Merged
merged 17 commits into from
Oct 25, 2023

Conversation

medvied
Copy link
Contributor

@medvied medvied commented Oct 17, 2023

Example:

CREATE EXTENSION lantern;
CREATE TABLE small_world (
    id SERIAL PRIMARY KEY,
    v REAL[2]
);
INSERT INTO small_world (v) VALUES ('{0,0,1}'), ('{0,1,0}');
CREATE INDEX ON small_world USING hnsw (v);

SELECT _lantern_internal.validate_index('small_world_v_idx');

The output of the last command:

INFO:  validate_index() start for small_world_v_idx with Oid=25770
INFO:  index_header = HnswIndexHeaderPage(version=1 vector_dim=3 m=16 ef_construction=128 ef=64 metric_kind=1 num_vectors=2 last_data_block=2 blockmap_page_groups=0)
INFO:  blocks_nr=3 nodes_nr=2
INFO:  blocks for: header 1 blockmap 1 nodes 1
INFO:  nodes per block: last block 2
INFO:  level=0: nodes 2 directed neighbor edges 2 min neighbors 1 max neighbors 1
INFO:  validate_index() done, no issues found.
 validate_index 
----------------
 
(1 row)

To see the indexes that could be passed to the function:

postgres=# \d small_world;
                            Table "public.small_world"
 Column |  Type   | Collation | Nullable |                 Default
--------+---------+-----------+----------+-----------------------------------------
 id     | integer |           | not null | nextval('small_world_id_seq'::regclass)
 v      | real[]  |           |          |
Indexes:
    "small_world_pkey" PRIMARY KEY, btree (id)
    "small_world_v_idx" hnsw (v)

This patch also adds validate_index() calls to the existing tests.

Example:

```
CREATE EXTENSION lantern;
CREATE TABLE small_world (
    id SERIAL PRIMARY KEY,
    v REAL[2]
);
INSERT INTO small_world (v) VALUES ('{0,0,1}'), ('{0,1,0}');
CREATE INDEX ON small_world USING hnsw (v);

SELECT _lantern_internal.validate_index('small_world_v_idx');
```

The output of the last command:
```
INFO:  validate_index() start for small_world_v_idx
INFO:  index_header = HnswIndexHeaderPage(version=1 vector_dim=3 m=16 ef_construction=128 ef=64 metric_kind=1 num_vectors=2 last_data_block=2 blockmap_page_groups=0)
INFO:  blocks_nr=3 nodes_nr=2
INFO:  blocks for: header 1 blockmap 1 nodes 1
INFO:  nodes per block: last block 2
INFO:  level=0: nodes 2 directed neighbor edges 2 min neighbors 1 max neighbors 1
INFO:  validate_index() done, no issues found.
 validate_index
----------------

(1 row)
```

To see the indexes that could be passed to the function:
```
postgres=# \d small_world;
                            Table "public.small_world"
 Column |  Type   | Collation | Nullable |                 Default
--------+---------+-----------+----------+-----------------------------------------
 id     | integer |           | not null | nextval('small_world_id_seq'::regclass)
 v      | real[]  |           |          |
Indexes:
    "small_world_pkey" PRIMARY KEY, btree (id)
    "small_world_v_idx" hnsw (v)
```

This patch also adds the validate_index() call to existing tests.
Because of use of RNG in hnsw_generate_new_level() the number of levels
in the newly INSERTed nodes is not deterministic, and validate_index()
output may change between runs, because it prints the number of nodes
for each level. If you see a sporadic test failures due to different
validate_index() info output please remove the validate_index() call
from the test.
Another solution would be to add an option validate_index() to tell if
elog() for the additional info is needed.
…ndex

They are compared and are used in the same expressions as other unsigned variables anyway.
There is no good reason for them to be signed.
…bled by default

This is required because some tests are building the HNSW index in a non-deterministic way.
Copy link
Contributor

@Ngalstyan4 Ngalstyan4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! need to do another pass over ldb_vi_analyze_blockmap but overall looks good

#include "hnsw/validate_index.h"

#include <access/heapam.h> /* relation_open */
#include <assert.h> /* assert */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ldb_invariant in all places where assert is used?
Assertion failures do not propagate to the client, so during development it is harder to see the output of an assert failure than an ldb_invariant failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


enum ldb_vi_block_type
{
LDB_VI_BLOCK_UNKNOWN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does vi stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment at the beginning of the file.

return true;
}

#define LDB_VI_READ_NODE_CHUNK(_chunk, _tape, _tape_pos, _tape_size) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the motivation for making this a macro, as opposed to a regular function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have precise name of whatever is being read in case of a read failure:
247 "Error reading " #_chunk
Also to add a chunk size:
243 size_t _chunk_size = sizeof(_chunk);

Alternatives: to pass a separate string parameter (error-prone and duplicate of _chunk parameter) or just to tell that "node read has failed" without being specific (worse error reporting, longer time to figure out what's wrong etc.). Without a macro the chunk size has to be passed anyway (which would be an error-prone duplicate of code).

I've changed the macro to be just a thing that adds #_chunk and the chunk size to the list of the function parameters. This would significantly simplify the macro with no undesirable side effects. It would even have a benefit that the tape_pos would be updated explicitly in the function, not somewhere inside a macro.

Thanks for highlighting a part of the code that could be substantially improved.

}
}

static bool ldb_vi_read_node_chunk(void *chunk, size_t chunk_size, void *tape, unsigned tape_pos, unsigned tape_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is never called outside of the macro definition below. Would it make sense to include the macro body in the function definition here and get rid of the synonymous macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the macro and posted a rationale for the macro. Please review.

vi_node->vn_neighbors[ level ] = neighbors;
}
/* the vector of floats is at the end */
/* XXX it's not clear why vn_dim != dimensions */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is vn_dim the one from usearch?
I think usearch stores that variable in terms of number of bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to reflect this.

struct ldb_vi_node *vi_nodes;

/* the code here doesn't change the index, so AccessShareLock is enough */
index = relation_open(indrelid, AccessShareLock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there validation necessary here?
what happens when the function is called with an invalid Oid? or with an Oid of a regular table and not an index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there validation necessary here?

Not sure. Do you have something specific in mind?
There is no error code returned and it seems like the function reports failures in form of elog(ERROR, ...).

what happens when the function is called with an invalid Oid?

ERROR: could not open relation with OID 31251

or with an Oid of a regular table and not an index?

ERROR: Invalid HnswIndexHeaderPage.magicNumber (page 0, got 7c9fc0, expected a47e20db)

There is also a case when a non-existent relation is passed as the first parameter to validate_index() in SQL:

SELECT _lantern_internal.validate_index('small_world_nonexistent_index', false);
ERROR:  relation "small_world_nonexistent_index" does not exist at character 41

index_header->blockmap_page_groups);
}

blocks_nr = RelationGetNumberOfBlocksInFork(index, MAIN_FORKNUM);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be INIT_FORKNUM, once we implement support for unlogged tables?
is it possible to read from the opened relation whether the relation is on MAIN or INIT fork?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be INIT_FORKNUM, once we implement support for unlogged tables?

It could. I can add the fork number as a parameter for this function. Or check for INIT fork and use it if it exists. What approach would you like to see here?

is it possible to read from the opened relation whether the relation is on MAIN or INIT fork?

It is. It's just a matter of passing a parameter across the code. Do you want me to add the parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. it is good as is for now! just confirming that adding unlogged tables would not add a lot of work for this function

};

/* represents a stored usearch node */
struct ldb_vi_node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: assumes internal state about usearch.
COuld you write a comment about this?
We will soon move to usearch v3 and may need to come update this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note.

@Ngalstyan4
Copy link
Contributor

@medvied , Could you rebase the PR so recently merged ASAN and update tests will run on this branch?

@medvied
Copy link
Contributor Author

medvied commented Oct 25, 2023

@Ngalstyan4: done.

@Ngalstyan4 Ngalstyan4 merged commit 8788729 into lanterndata:main Oct 25, 2023
@medvied medvied deleted the validate-index branch October 25, 2023 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants