Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically scale usearch index on construction #122

Merged
merged 4 commits into from
Sep 10, 2023

Conversation

ezra-varady
Copy link
Collaborator

This replaces the fixed size allocation in BuildIndex with a scaling during tuple insertion. It alters cost estimates slightly, but not significantly

@var77
Copy link
Collaborator

var77 commented Sep 7, 2023

Hi @ezra-varady thanks for the improvement!
I have some idea about the initial index size reservation. Maybe we can estimate the row count and set the initial index size based on that?
I wrote a simple PoC, and tested on 1-6k data, seems to work okay, but I am not sure about the bigger data size and the costs of this operations.

    BlockNumber numBlocks = RelationGetNumberOfBlocks(heap);
    uint32_t    estimated_row_count = 0;
    if(numBlocks > 0) {
        // Read the first block
        Buffer buffer = ReadBufferExtended(heap, MAIN_FORKNUM, 0, RBM_NORMAL, NULL);
        // Lock buffer so there won't be any new writes during this operation
        LockBuffer(buffer, BUFFER_LOCK_SHARE);
        // This is like converting block buffer to Page struct
        Page         page = BufferGetPage(buffer);
        // Getting the maximum tuple index on the page
        OffsetNumber offset = PageGetMaxOffsetNumber(page);
        // Estimating the row count in the table
        // There can be 3 cases
        // 1 - new data is added and numBlocks gets increased. In this case the estimation will be lower than actual number (we need to check and increase index size)
        // 2 - the last page is not fully associated (this is most likely to happen). In this case we will have a bit higher estimated number
        // 3 - the last page is fully associated and we get exactly the number of rows that the table has (this is very rare case I think)
        estimated_row_count = offset * numBlocks;
        // Unlock and release buffer
        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
        ReleaseBuffer(buffer);
    }

cc: @Ngalstyan4

@ezra-varady
Copy link
Collaborator Author

Good idea! Let me incorporate this

Copy link
Contributor

@Ngalstyan4 Ngalstyan4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

src/hnsw/build.c Outdated
estimated_row_count = offset * numBlocks;
// Unlock and release buffer
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function called `UnlockReleaseBuffer that combines these two. That should be preferred.

src/hnsw/build.c Outdated
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buffer);
}
usearch_reserve(buildstate->usearch_index, estimated_row_count, &error);
assert(error == NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you changes this if (error != NULL) elog(ERROR,...)? Make sure any cleanup that is necessary is done before elog (e.g. usearch_index probably needs to be closed)

The reason we treat this particular error differently:
It is actually quite likely that this cause a large memory allocation, so it is probably more likely to fail.
We will soon deal with such failures in a more comprehensive manner but since this one is particularly likely, we should expect error to be non-NULL.

src/hnsw/build.c Outdated
if(buildstate->usearch_index != NULL) usearch_add(buildstate->usearch_index, label, vector, usearch_scalar, &error);
if(buildstate->usearch_index != NULL) {
size_t capacity;
capacity = usearch_capacity(buildstate->usearch_index, &error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the declaration and assignment on the same line.

src/hnsw/build.c Outdated
// 1 - new data is added and numBlocks gets increased. In this case the estimation will be lower than actual
// number (we need to check and increase index size) 2 - the last page is not fully associated (this is most
// likely to happen). In this case we will have a bit higher estimated number 3 - the last page is fully
// associated and we get exactly the number of rows that the table has (this is very rare case I think)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case 1, do you mean new blocks are added between usearch_reserve() and index insertions?

And, there actually are a lot more cases no? E.g. some of the tuples in the blocks could be deleted but not vacuumed, some blocks could represent variable length TOASTed data so offset-based heuristic above would fail, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes actually true, for deleted and not vacuumed rows we will allocate some more memory than needed, and the TOASTed rows will be handled by the resizing logic correctly I guess

size_t capacity;
capacity = usearch_capacity(buildstate->usearch_index, &error);
if(capacity == usearch_size(buildstate->usearch_index, &error)) {
usearch_reserve(buildstate->usearch_index, 2 * capacity, &error);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reallocating a memory of (capacity + estimated size of one page) give some benefits here?
@Ngalstyan4 @ezra-varady

@ezra-varady
Copy link
Collaborator Author

ezra-varady commented Sep 8, 2023

Fixed the issues mentioned above. @var77 I'm not sure, can you say more? Also, I think that Varik's code is accurate enough that we never have to resize after the first guess in the current tests so we may want to find a way to force this case

@var77
Copy link
Collaborator

var77 commented Sep 8, 2023

Fixed the issues mentioned above. @var77 I'm not sure, can you say more? Also, I think that Varik's code is accurate enough that we never have to resize after the first guess in the current tests so we may want to find a way to force this case

I was thinking that if the estimations will be accurate, we may be off at max 1 page, but just talked with Narek and seems the TOAST-ed rows columns may work differently, so we should check how the estimations will behave on there. I will try to read about TOAST columns and do some tests during the day.

@Ngalstyan4 Ngalstyan4 merged commit caa3252 into lanterndata:main Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants