Prune blobs can OOM #6559

michaelsproul · 2024-11-01T06:23:41Z

Description

We have a bug report from AllNodes about a Lighthouse node using 77GB of RAM when switching from --prune-blobs false to --prune-blobs true.

They helpfully shared a jemalloc memory dump which shows that the allocations are within do_atomically_with_block_and_blobs_cache, in particular in the get_blobs call under the partition.

The text was updated successfully, but these errors were encountered:

michaelsproul · 2024-11-01T06:26:32Z

The reason we're doing get_blobs is so we can undo the write to the blobs DB in the case where the write to the hot DB fails.

I would like to consider refactoring and deleting large parts of do_atomically_with_block_and_blobs_cache.

michaelsproul · 2024-11-01T07:23:11Z

Alternative strategy:

Do away with the oldest_blob_slot. It's kind of a pain anyway. Pruning could proceed by just iterating all the blobs and deleting ones prior to the data availability period. On nodes that are already pruning there aren't really that many blobs, and we can avoid having to hold too many in memory (load one at a time, check whether it needs to be deleted, stage for deletion). For the OP's case of pruning a DB with lots of blobs this will also work fine, and will only run slowly once (the first time switching from --prune-blobs false to true).

I think this will allow us to get rid of all the complexity around reverting blob DB transactions and coordinating writes across databases. I think the only other place it's currently used is when importing blocks. In this case we can write the blobs first, and then write the block, so that we maintain the invariant:

block in hot_db && blobs_in_da_period --> blobs for block in blobs_db

eserilev · 2024-11-02T00:12:20Z

i can pick this one up

I think itd be nice to add a delete_while method to the KeyValueStore trait, just like in the slasher, for stuff like this. It also allows us to use extract_if for redb which is optimized for deleting a range of values

michaelsproul · 2025-02-07T05:27:43Z

Fixed now by:

Modularize beacon node backend #4718

michaelsproul added optimization Something to make Lighthouse run more efficiently. database labels Nov 1, 2024

eserilev self-assigned this Nov 2, 2024

This was referenced Nov 7, 2024

Fix issue where prune blobs can OOM #6571

Closed

Remove DataColumnInfo #6572

Open

michaelsproul mentioned this issue Dec 6, 2024

Memory Leaks and OOM #6663

Closed

michaelsproul mentioned this issue Jan 7, 2025

Make max_blobs_per_block a config parameter #6329

Merged

michaelsproul closed this as completed Feb 7, 2025

michaelsproul mentioned this issue Feb 7, 2025

Import expired blobs #5391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune blobs can OOM #6559

Prune blobs can OOM #6559

michaelsproul commented Nov 1, 2024

michaelsproul commented Nov 1, 2024

michaelsproul commented Nov 1, 2024

eserilev commented Nov 2, 2024 •

edited

Loading

michaelsproul commented Feb 7, 2025

Prune blobs can OOM #6559

Prune blobs can OOM #6559

Comments

michaelsproul commented Nov 1, 2024

Description

michaelsproul commented Nov 1, 2024

michaelsproul commented Nov 1, 2024

eserilev commented Nov 2, 2024 • edited Loading

michaelsproul commented Feb 7, 2025

eserilev commented Nov 2, 2024 •

edited

Loading