-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically scale usearch index on construction #122
Dynamically scale usearch index on construction #122
Conversation
…ing a fixed size. This slightly changes cost estimates (<1%)
Hi @ezra-varady thanks for the improvement! BlockNumber numBlocks = RelationGetNumberOfBlocks(heap);
uint32_t estimated_row_count = 0;
if(numBlocks > 0) {
// Read the first block
Buffer buffer = ReadBufferExtended(heap, MAIN_FORKNUM, 0, RBM_NORMAL, NULL);
// Lock buffer so there won't be any new writes during this operation
LockBuffer(buffer, BUFFER_LOCK_SHARE);
// This is like converting block buffer to Page struct
Page page = BufferGetPage(buffer);
// Getting the maximum tuple index on the page
OffsetNumber offset = PageGetMaxOffsetNumber(page);
// Estimating the row count in the table
// There can be 3 cases
// 1 - new data is added and numBlocks gets increased. In this case the estimation will be lower than actual number (we need to check and increase index size)
// 2 - the last page is not fully associated (this is most likely to happen). In this case we will have a bit higher estimated number
// 3 - the last page is fully associated and we get exactly the number of rows that the table has (this is very rare case I think)
estimated_row_count = offset * numBlocks;
// Unlock and release buffer
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
ReleaseBuffer(buffer);
} cc: @Ngalstyan4 |
Good idea! Let me incorporate this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work!
src/hnsw/build.c
Outdated
estimated_row_count = offset * numBlocks; | ||
// Unlock and release buffer | ||
LockBuffer(buffer, BUFFER_LOCK_UNLOCK); | ||
ReleaseBuffer(buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function called `UnlockReleaseBuffer that combines these two. That should be preferred.
src/hnsw/build.c
Outdated
LockBuffer(buffer, BUFFER_LOCK_UNLOCK); | ||
ReleaseBuffer(buffer); | ||
} | ||
usearch_reserve(buildstate->usearch_index, estimated_row_count, &error); | ||
assert(error == NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you changes this if (error != NULL) elog(ERROR,...)
? Make sure any cleanup that is necessary is done before elog
(e.g. usearch_index probably needs to be closed)
The reason we treat this particular error differently:
It is actually quite likely that this cause a large memory allocation, so it is probably more likely to fail.
We will soon deal with such failures in a more comprehensive manner but since this one is particularly likely, we should expect error to be non-NULL.
src/hnsw/build.c
Outdated
if(buildstate->usearch_index != NULL) usearch_add(buildstate->usearch_index, label, vector, usearch_scalar, &error); | ||
if(buildstate->usearch_index != NULL) { | ||
size_t capacity; | ||
capacity = usearch_capacity(buildstate->usearch_index, &error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the declaration and assignment on the same line.
src/hnsw/build.c
Outdated
// 1 - new data is added and numBlocks gets increased. In this case the estimation will be lower than actual | ||
// number (we need to check and increase index size) 2 - the last page is not fully associated (this is most | ||
// likely to happen). In this case we will have a bit higher estimated number 3 - the last page is fully | ||
// associated and we get exactly the number of rows that the table has (this is very rare case I think) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case 1, do you mean new blocks are added between usearch_reserve()
and index insertions?
And, there actually are a lot more cases no? E.g. some of the tuples in the blocks could be deleted but not vacuumed, some blocks could represent variable length TOASTed data so offset-based heuristic above would fail, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes actually true, for deleted and not vacuumed rows we will allocate some more memory than needed, and the TOASTed rows will be handled by the resizing logic correctly I guess
size_t capacity; | ||
capacity = usearch_capacity(buildstate->usearch_index, &error); | ||
if(capacity == usearch_size(buildstate->usearch_index, &error)) { | ||
usearch_reserve(buildstate->usearch_index, 2 * capacity, &error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will reallocating a memory of (capacity + estimated size of one page) give some benefits here?
@Ngalstyan4 @ezra-varady
… log, cleanup/clarify
Fixed the issues mentioned above. @var77 I'm not sure, can you say more? Also, I think that Varik's code is accurate enough that we never have to resize after the first guess in the current tests so we may want to find a way to force this case |
I was thinking that if the estimations will be accurate, we may be off at max 1 page, but just talked with Narek and seems the TOAST-ed |
This replaces the fixed size allocation in
BuildIndex
with a scaling during tuple insertion. It alters cost estimates slightly, but not significantly