Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune blobs can OOM #6559

Closed
michaelsproul opened this issue Nov 1, 2024 · 4 comments
Closed

Prune blobs can OOM #6559

michaelsproul opened this issue Nov 1, 2024 · 4 comments
Assignees
Labels
database optimization Something to make Lighthouse run more efficiently.

Comments

@michaelsproul
Copy link
Member

Description

We have a bug report from AllNodes about a Lighthouse node using 77GB of RAM when switching from --prune-blobs false to --prune-blobs true.

They helpfully shared a jemalloc memory dump which shows that the allocations are within do_atomically_with_block_and_blobs_cache, in particular in the get_blobs call under the partition.

@michaelsproul michaelsproul added optimization Something to make Lighthouse run more efficiently. database labels Nov 1, 2024
@michaelsproul
Copy link
Member Author

The reason we're doing get_blobs is so we can undo the write to the blobs DB in the case where the write to the hot DB fails.

I would like to consider refactoring and deleting large parts of do_atomically_with_block_and_blobs_cache.

@michaelsproul
Copy link
Member Author

Alternative strategy:

Do away with the oldest_blob_slot. It's kind of a pain anyway. Pruning could proceed by just iterating all the blobs and deleting ones prior to the data availability period. On nodes that are already pruning there aren't really that many blobs, and we can avoid having to hold too many in memory (load one at a time, check whether it needs to be deleted, stage for deletion). For the OP's case of pruning a DB with lots of blobs this will also work fine, and will only run slowly once (the first time switching from --prune-blobs false to true).

I think this will allow us to get rid of all the complexity around reverting blob DB transactions and coordinating writes across databases. I think the only other place it's currently used is when importing blocks. In this case we can write the blobs first, and then write the block, so that we maintain the invariant:

block in hot_db && blobs_in_da_period --> blobs for block in blobs_db

@eserilev
Copy link
Collaborator

eserilev commented Nov 2, 2024

i can pick this one up

I think itd be nice to add a delete_while method to the KeyValueStore trait, just like in the slasher, for stuff like this. It also allows us to use extract_if for redb which is optimized for deleting a range of values

@michaelsproul
Copy link
Member Author

Fixed now by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database optimization Something to make Lighthouse run more efficiently.
Projects
None yet
Development

No branches or pull requests

2 participants