Dynamically scale usearch index on construction #122

ezra-varady · 2023-09-07T18:54:37Z

This replaces the fixed size allocation in BuildIndex with a scaling during tuple insertion. It alters cost estimates slightly, but not significantly

…ing a fixed size. This slightly changes cost estimates (<1%)

var77 · 2023-09-07T19:42:34Z

Hi @ezra-varady thanks for the improvement!
I have some idea about the initial index size reservation. Maybe we can estimate the row count and set the initial index size based on that?
I wrote a simple PoC, and tested on 1-6k data, seems to work okay, but I am not sure about the bigger data size and the costs of this operations.

    BlockNumber numBlocks = RelationGetNumberOfBlocks(heap);
    uint32_t    estimated_row_count = 0;
    if(numBlocks > 0) {
        // Read the first block
        Buffer buffer = ReadBufferExtended(heap, MAIN_FORKNUM, 0, RBM_NORMAL, NULL);
        // Lock buffer so there won't be any new writes during this operation
        LockBuffer(buffer, BUFFER_LOCK_SHARE);
        // This is like converting block buffer to Page struct
        Page         page = BufferGetPage(buffer);
        // Getting the maximum tuple index on the page
        OffsetNumber offset = PageGetMaxOffsetNumber(page);
        // Estimating the row count in the table
        // There can be 3 cases
        // 1 - new data is added and numBlocks gets increased. In this case the estimation will be lower than actual number (we need to check and increase index size)
        // 2 - the last page is not fully associated (this is most likely to happen). In this case we will have a bit higher estimated number
        // 3 - the last page is fully associated and we get exactly the number of rows that the table has (this is very rare case I think)
        estimated_row_count = offset * numBlocks;
        // Unlock and release buffer
        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
        ReleaseBuffer(buffer);
    }

cc: @Ngalstyan4

ezra-varady · 2023-09-07T19:54:57Z

Good idea! Let me incorporate this

Ngalstyan4

Good work!

Ngalstyan4 · 2023-09-08T06:07:25Z

src/hnsw/build.c

+            estimated_row_count = offset * numBlocks;
+            // Unlock and release buffer
+            LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+            ReleaseBuffer(buffer);


There is a function called `UnlockReleaseBuffer that combines these two. That should be preferred.

Ngalstyan4 · 2023-09-08T06:10:15Z

src/hnsw/build.c

+            LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+            ReleaseBuffer(buffer);
+        }
+        usearch_reserve(buildstate->usearch_index, estimated_row_count, &error);
        assert(error == NULL);


can you changes this if (error != NULL) elog(ERROR,...)? Make sure any cleanup that is necessary is done before elog (e.g. usearch_index probably needs to be closed)

The reason we treat this particular error differently:
It is actually quite likely that this cause a large memory allocation, so it is probably more likely to fail.
We will soon deal with such failures in a more comprehensive manner but since this one is particularly likely, we should expect error to be non-NULL.

Ngalstyan4 · 2023-09-08T06:18:24Z

src/hnsw/build.c

-    if(buildstate->usearch_index != NULL) usearch_add(buildstate->usearch_index, label, vector, usearch_scalar, &error);
+    if(buildstate->usearch_index != NULL) {
+        size_t capacity;
+        capacity = usearch_capacity(buildstate->usearch_index, &error);


Do the declaration and assignment on the same line.

Ngalstyan4 · 2023-09-08T06:23:09Z

src/hnsw/build.c

+            // 1 - new data is added and numBlocks gets increased. In this case the estimation will be lower than actual
+            // number (we need to check and increase index size) 2 - the last page is not fully associated (this is most
+            // likely to happen). In this case we will have a bit higher estimated number 3 - the last page is fully
+            // associated and we get exactly the number of rows that the table has (this is very rare case I think)


In case 1, do you mean new blocks are added between usearch_reserve() and index insertions?

And, there actually are a lot more cases no? E.g. some of the tuples in the blocks could be deleted but not vacuumed, some blocks could represent variable length TOASTed data so offset-based heuristic above would fail, etc.

Yes actually true, for deleted and not vacuumed rows we will allocate some more memory than needed, and the TOASTed rows will be handled by the resizing logic correctly I guess

var77 · 2023-09-08T06:34:02Z

src/hnsw/build.c

+        size_t capacity;
+        capacity = usearch_capacity(buildstate->usearch_index, &error);
+        if(capacity == usearch_size(buildstate->usearch_index, &error)) {
+            usearch_reserve(buildstate->usearch_index, 2 * capacity, &error);


Will reallocating a memory of (capacity + estimated size of one page) give some benefits here?
@Ngalstyan4 @ezra-varady

… log, cleanup/clarify

ezra-varady · 2023-09-08T07:31:01Z

Fixed the issues mentioned above. @var77 I'm not sure, can you say more? Also, I think that Varik's code is accurate enough that we never have to resize after the first guess in the current tests so we may want to find a way to force this case

var77 · 2023-09-08T07:43:07Z

Fixed the issues mentioned above. @var77 I'm not sure, can you say more? Also, I think that Varik's code is accurate enough that we never have to resize after the first guess in the current tests so we may want to find a way to force this case

I was thinking that if the estimations will be accurate, we may be off at max 1 page, but just talked with Narek and seems the TOAST-ed ~~rows~~ columns may work differently, so we should check how the estimations will behave on there. I will try to read about TOAST columns and do some tests during the day.

ezra-varady added 2 commits September 7, 2023 08:49

alter index construction to scale the size of the index instead of us…

47d3e32

…ing a fixed size. This slightly changes cost estimates (<1%)

clang-format

5080ebe

incorporate Variks table size estimate

664373b

Ngalstyan4 reviewed Sep 8, 2023

View reviewed changes

var77 reviewed Sep 8, 2023

View reviewed changes

replace unlock/release with joint function, replace assert with error…

56313e6

… log, cleanup/clarify

Ngalstyan4 merged commit caa3252 into lanterndata:main Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically scale usearch index on construction #122

Dynamically scale usearch index on construction #122

ezra-varady commented Sep 7, 2023

var77 commented Sep 7, 2023

ezra-varady commented Sep 7, 2023

Ngalstyan4 left a comment

Ngalstyan4 Sep 8, 2023

Ngalstyan4 Sep 8, 2023

Ngalstyan4 Sep 8, 2023

Ngalstyan4 Sep 8, 2023

var77 Sep 8, 2023

var77 Sep 8, 2023

ezra-varady commented Sep 8, 2023 •

edited

Loading

var77 commented Sep 8, 2023 •

edited by Ngalstyan4

Loading

Dynamically scale usearch index on construction #122

Dynamically scale usearch index on construction #122

Conversation

ezra-varady commented Sep 7, 2023

var77 commented Sep 7, 2023

ezra-varady commented Sep 7, 2023

Ngalstyan4 left a comment

Choose a reason for hiding this comment

Ngalstyan4 Sep 8, 2023

Choose a reason for hiding this comment

Ngalstyan4 Sep 8, 2023

Choose a reason for hiding this comment

Ngalstyan4 Sep 8, 2023

Choose a reason for hiding this comment

Ngalstyan4 Sep 8, 2023

Choose a reason for hiding this comment

var77 Sep 8, 2023

Choose a reason for hiding this comment

var77 Sep 8, 2023

Choose a reason for hiding this comment

ezra-varady commented Sep 8, 2023 • edited Loading

var77 commented Sep 8, 2023 • edited by Ngalstyan4 Loading

ezra-varady commented Sep 8, 2023 •

edited

Loading

var77 commented Sep 8, 2023 •

edited by Ngalstyan4

Loading