Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Partitioning? #3

Open
mrd0ll4r opened this issue Dec 2, 2024 · 9 comments
Open

Wrong Partitioning? #3

mrd0ll4r opened this issue Dec 2, 2024 · 9 comments

Comments

@mrd0ll4r
Copy link
Contributor

mrd0ll4r commented Dec 2, 2024

Hello, it's me again!

I finally got around to give the whole thing a spin. While partitioning, I noticed that the migration partitions by repo instead of collection (which is what it says in the README). I played around a bit, with not-so-many records, but still:

Script to un-partition:

BEGIN;

alter table records rename to records_old;

create table records
(like records_old including defaults);

alter sequence records_id_seq owned by records.id;

with moved_rows as (
        delete from records_old r
        returning r.*
)
insert into records select * from moved_rows;
drop table records_old;

COMMIT;

and then run docker compose up update-db-schema to re-create the indices correctly.

No partitioning

Pros (maybe)

Cons (maybe)

  • Slow queries by repo or collection, probably

Query performance

bluesky=# SELECT COUNT(*) FROM records;
 count
--------
 888622
(1 row)

bluesky=# EXPLAIN ANALYZE SELECT COUNT(*) FROM records WHERE collection='app.bsky.feed.post';
                                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=37811.76..37811.78 rows=1 width=8) (actual time=67.560..67.561 rows=1 loops=1)
   ->  Index Only Scan using idx_repo_record_key on records  (cost=0.42..37472.83 rows=135574 width=0) (actual time=0.138..61.037 rows=133835 loops=1)
         Index Cond: (collection = 'app.bsky.feed.post'::text)
         Heap Fetches: 0
 Planning Time: 0.250 ms
 Execution Time: 67.593 ms
(6 rows)

Partition by repo

BEGIN;

alter table records rename to records_old;

create table records
(like records_old including defaults)
partition by hash (repo);

alter sequence records_id_seq owned by records.id;

do $$
begin
for i in 0..15 loop
   execute 'create table records_' || i || ' partition of records for values with (modulus 16, remainder ' || i || ')';
end loop;
end $$;

with moved_rows as (
        delete from records_old r
        returning r.*
)
insert into records select * from moved_rows;
drop table records_old;

COMMIT;

Pros (maybe)

  • Backfilling one repo deals with one partition only
  • 16 partitions to fill into/query from in parallel
  • Tables all have the same size

Cons (maybe)

  • No filtering by collections for per-collection queries

Query performance

bluesky=# EXPLAIN ANALYZE SELECT COUNT(*) FROM records WHERE collection='app.bsky.feed.post';
                                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=38140.00..38140.01 rows=1 width=8) (actual time=73.662..73.670 rows=1 loops=1)
   ->  Append  (cost=0.41..37877.62 rows=104954 width=0) (actual time=0.031..67.642 rows=133835 loops=1)
         ->  Index Only Scan using records_0_repo_collection_rkey_idx on records_0 records_1  (cost=0.41..2537.39 rows=8227 width=0) (actual time=0.031..3.955 rows=8132 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_1_repo_collection_rkey_idx on records_1 records_2  (cost=0.41..1957.87 rows=9082 width=0) (actual time=0.027..3.437 rows=9107 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_2_repo_collection_rkey_idx on records_2 records_3  (cost=0.41..2439.50 rows=8632 width=0) (actual time=0.034..3.707 rows=8620 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_3_repo_collection_rkey_idx on records_3 records_4  (cost=0.41..1582.57 rows=6913 width=0) (actual time=0.027..2.432 rows=6922 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_4_repo_collection_rkey_idx on records_4 records_5  (cost=0.41..1229.93 rows=4963 width=0) (actual time=0.025..2.102 rows=4963 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_5_repo_collection_rkey_idx on records_5 records_6  (cost=0.41..1962.92 rows=6149 width=0) (actual time=0.179..3.378 rows=6202 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_6_repo_collection_rkey_idx on records_6 records_7  (cost=0.41..1849.34 rows=5618 width=0) (actual time=0.029..3.009 rows=5611 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_7_repo_collection_rkey_idx on records_7 records_8  (cost=0.41..1718.71 rows=6338 width=0) (actual time=0.051..2.535 rows=6340 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_8_repo_collection_rkey_idx on records_8 records_9  (cost=0.41..1540.59 rows=7652 width=0) (actual time=0.116..2.568 rows=7681 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_9_repo_collection_rkey_idx on records_9 records_10  (cost=0.41..2601.65 rows=15127 width=0) (actual time=0.026..4.496 rows=15138 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_10_repo_collection_rkey_idx on records_10 records_11  (cost=0.42..4614.40 rows=566 width=0) (actual time=0.027..6.468 rows=6859 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_11_repo_collection_rkey_idx on records_11 records_12  (cost=0.41..2128.93 rows=260 width=0) (actual time=0.027..3.935 rows=11692 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_12_repo_collection_rkey_idx on records_12 records_13  (cost=0.42..3487.70 rows=11513 width=0) (actual time=0.034..5.376 rows=11536 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_13_repo_collection_rkey_idx on records_13 records_14  (cost=0.42..3772.36 rows=13434 width=0) (actual time=0.065..5.868 rows=13539 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_14_repo_collection_rkey_idx on records_14 records_15  (cost=0.42..2939.74 rows=360 width=0) (actual time=0.027..4.406 rows=7974 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
         ->  Index Only Scan using records_15_repo_collection_rkey_idx on records_15 records_16  (cost=0.41..989.23 rows=120 width=0) (actual time=0.025..1.598 rows=3519 loops=1)
               Index Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Fetches: 0
 Planning Time: 1.849 ms
 Execution Time: 73.747 ms
(52 rows)

Partition by collection

BEGIN;

alter table records rename to records_old;

create table records
(like records_old including defaults)
partition by hash (collection);

alter sequence records_id_seq owned by records.id;

do $$
begin
for i in 0..15 loop
   execute 'create table records_' || i || ' partition of records for values with (modulus 16, remainder ' || i || ')';
end loop;
end $$;

with moved_rows as (
        delete from records_old r
        returning r.*
)
insert into records select * from moved_rows;
drop table records_old;

COMMIT;

Pros (maybe)

  • Per-collection queries only touch one partition

Cons (maybe)

  • Insertions are all over the place
  • Tables have vastly different sizes

Query performance

bluesky=# EXPLAIN ANALYZE SELECT COUNT(*) FROM records WHERE collection='app.bsky.feed.post';
                                                                                             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=6988.16..6988.17 rows=1 width=8) (actual time=28.228..33.945 rows=1 loops=1)
   ->  Gather  (cost=6987.84..6988.15 rows=3 width=8) (actual time=28.157..33.934 rows=4 loops=1)
         Workers Planned: 3
         Workers Launched: 3
         ->  Partial Aggregate  (cost=5987.84..5987.85 rows=1 width=8) (actual time=17.989..17.990 rows=1 loops=4)
               ->  Parallel Index Only Scan using records_11_repo_collection_rkey_idx on records_11 records  (cost=0.42..5879.91 rows=43173 width=0) (actual time=0.182..15.952 rows=33459 loops=4)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
                     Heap Fetches: 0
 Planning Time: 0.893 ms
 Execution Time: 34.000 ms
(10 rows)

Partition by repo, then sub-partition by collection

BEGIN;

alter table records rename to records_old;

create table records
(like records_old including defaults)
partition by hash (repo);

alter sequence records_id_seq owned by records.id;

do $$
begin
for i in 0..15 loop
   execute 'create table records_' || i || ' partition of records for values with (modulus 16, remainder ' || i || ') partition by hash (collection)';
for j in 0..15 loop
   execute 'create table records_' || i || '_' || j || ' partition of records_' || i || ' for values with (modulus 16, remainder ' || j || ')';
end loop;
end loop;
end $$;

with moved_rows as (
        delete from records_old r
        returning r.*
)
insert into records select * from moved_rows;
drop table records_old;

COMMIT;

Pros (maybe)

  • Best of both worlds?

Cons (maybe)

  • Looks a bit ugly, and we still get very different table sizes, depending on which collection they are for

Query performance

bluesky=# EXPLAIN ANALYZE SELECT COUNT(*) FROM records WHERE collection='app.bsky.feed.post';
                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=7684.04..7684.05 rows=1 width=8) (actual time=49.355..49.365 rows=1 loops=1)
   ->  Append  (cost=337.28..7682.36 rows=671 width=0) (actual time=1.461..44.356 rows=133835 loops=1)
         ->  Bitmap Heap Scan on records_0_11 records_1  (cost=337.28..468.23 rows=41 width=0) (actual time=1.460..4.555 rows=8132 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=659
               ->  Bitmap Index Scan on records_0_11_repo_collection_rkey_idx  (cost=0.00..337.27 rows=41 width=0) (actual time=1.312..1.312 rows=8132 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_1_11 records_2  (cost=376.60..518.20 rows=46 width=0) (actual time=1.548..3.588 rows=9107 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=540
               ->  Bitmap Index Scan on records_1_11_repo_collection_rkey_idx  (cost=0.00..376.59 rows=46 width=0) (actual time=1.467..1.467 rows=9107 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_2_11 records_3  (cost=356.95..491.49 rows=43 width=0) (actual time=0.902..2.916 rows=8620 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=577
               ->  Bitmap Index Scan on records_2_11_repo_collection_rkey_idx  (cost=0.00..356.94 rows=43 width=0) (actual time=0.839..0.839 rows=8620 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_3_11 records_4  (cost=288.21..398.23 rows=35 width=0) (actual time=0.723..2.390 rows=6922 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=507
               ->  Bitmap Index Scan on records_3_11_repo_collection_rkey_idx  (cost=0.00..288.20 rows=35 width=0) (actual time=0.673..0.673 rows=6922 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_4_11 records_5  (cost=209.51..290.06 rows=25 width=0) (actual time=0.506..1.686 rows=4963 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=360
               ->  Bitmap Index Scan on records_4_11_repo_collection_rkey_idx  (cost=0.00..209.50 rows=25 width=0) (actual time=0.465..0.465 rows=4963 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_5_11 records_6  (cost=258.81..356.23 rows=31 width=0) (actual time=0.630..2.101 rows=6202 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=461
               ->  Bitmap Index Scan on records_5_11_repo_collection_rkey_idx  (cost=0.00..258.80 rows=31 width=0) (actual time=0.578..0.578 rows=6202 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_6_11 records_7  (cost=234.37..326.66 rows=28 width=0) (actual time=0.628..1.679 rows=5611 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=491
               ->  Bitmap Index Scan on records_6_11_repo_collection_rkey_idx  (cost=0.00..234.36 rows=28 width=0) (actual time=0.577..0.578 rows=5611 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_7_11 records_8  (cost=263.84..362.73 rows=32 width=0) (actual time=0.478..1.412 rows=6340 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=412
               ->  Bitmap Index Scan on records_7_11_repo_collection_rkey_idx  (cost=0.00..263.83 rows=32 width=0) (actual time=0.443..0.443 rows=6340 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_8_11 records_9  (cost=317.90..438.97 rows=38 width=0) (actual time=0.522..1.703 rows=7681 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=607
               ->  Bitmap Index Scan on records_8_11_repo_collection_rkey_idx  (cost=0.00..317.89 rows=38 width=0) (actual time=0.471..0.472 rows=7681 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_9_11 records_10  (cost=629.96..868.42 rows=76 width=0) (actual time=1.023..3.236 rows=15138 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=1066
               ->  Bitmap Index Scan on records_9_11_repo_collection_rkey_idx  (cost=0.00..629.94 rows=76 width=0) (actual time=0.929..0.930 rows=15138 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_10_11 records_11  (cost=287.73..395.55 rows=34 width=0) (actual time=0.459..1.459 rows=6859 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=534
               ->  Bitmap Index Scan on records_10_11_repo_collection_rkey_idx  (cost=0.00..287.73 rows=34 width=0) (actual time=0.414..0.414 rows=6859 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_11_11 records_12  (cost=479.99..656.96 rows=58 width=0) (actual time=0.806..2.479 rows=11692 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=693
               ->  Bitmap Index Scan on records_11_11_repo_collection_rkey_idx  (cost=0.00..479.98 rows=58 width=0) (actual time=0.746..0.746 rows=11692 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_12_11 records_13  (cost=474.82..661.57 rows=58 width=0) (actual time=0.843..2.707 rows=11536 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=946
               ->  Bitmap Index Scan on records_12_11_repo_collection_rkey_idx  (cost=0.00..474.81 rows=58 width=0) (actual time=0.747..0.747 rows=11536 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_13_11 records_14  (cost=561.97..775.67 rows=68 width=0) (actual time=0.936..2.925 rows=13539 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=989
               ->  Bitmap Index Scan on records_13_11_repo_collection_rkey_idx  (cost=0.00..561.95 rows=68 width=0) (actual time=0.848..0.848 rows=13539 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_14_11 records_15  (cost=332.10..461.79 rows=40 width=0) (actual time=0.556..1.803 rows=7974 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=743
               ->  Bitmap Index Scan on records_14_11_repo_collection_rkey_idx  (cost=0.00..332.09 rows=40 width=0) (actual time=0.494..0.494 rows=7974 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
         ->  Bitmap Heap Scan on records_15_11 records_16  (cost=150.68..208.24 rows=18 width=0) (actual time=0.238..0.736 rows=3519 loops=1)
               Recheck Cond: (collection = 'app.bsky.feed.post'::text)
               Heap Blocks: exact=244
               ->  Bitmap Index Scan on records_15_11_repo_collection_rkey_idx  (cost=0.00..150.67 rows=18 width=0) (actual time=0.214..0.214 rows=3519 loops=1)
                     Index Cond: (collection = 'app.bsky.feed.post'::text)
 Planning Time: 8.813 ms
 Execution Time: 49.515 ms
(84 rows)

Opinions?

@mrd0ll4r mrd0ll4r changed the title Wrong Partitioning Wrong Partitioning? Dec 2, 2024
@imax9000
Copy link
Contributor

Yeah, I haven't updated the readme yet 😅

I switched to hash partitioning by repo to allow for more partitions and having them more evenly sized. But even with that on my hardware I was able to ingest only up to 2k events/s.

@mrd0ll4r
Copy link
Contributor Author

Hmm. I run the dual partitioning now, with 16 and 16 partitions, i.e., 256 in total. I haven't checked the metrics, but I think it's fully caught up and everything is synced after approx. a week or so. This hardware is (supposedly) a 2TB NVMe SSD, on which I run ZFS with compression. I don't use ScyllaDB because I don't trust these NoSQL databases, and I don't want to start indexing from scratch.

The database is a healthy 940 GB on disk with a compression ratio of 2.8 😅 well, it works...

@imax9000
Copy link
Contributor

Oh, nice! Do you run prometheus and grafana to keep track of it? What kind of performance numbers did you get?

(I recently got a 4TB NVMe SSD, but my indexer is still running on a SATA SSD)

@mrd0ll4r
Copy link
Contributor Author

Do you run prometheus and grafana to keep track of it?

nope 🙃 I usually run them in the same docker compose setup, but I didn't add it yet.

Which version of Grafana does the dashboard work with? Any plugins etc. I need? (In the best case, do you have a docker compose snippet flying around for that?)

@imax9000
Copy link
Contributor

There's an outdated compose file here: https://github.com/uabluerail/indexer/blob/main/metrics/docker-compose.yml

I didn't use it, since I'm running an instanced for everything I have at home and just added it to the config. This is my dashboard: https://github.com/uabluerail/indexer/blob/main/dashboards/indexer.json (also quite outdated), with some scripts to manage it in Grafana https://github.com/uabluerail/indexer/blob/main/Makefile#L95

@mrd0ll4r
Copy link
Contributor Author

Aha! I set all that up. The dashboard is very nice! Some of the hardware stuff is specific to your setup, but that's fine :)

I noticed I had a delay of approx. 24 hours to some of the official PDSs. Most of them were around a few hours. I turned off the record indexer to catch up overnight, and I'm almost caught up now. In general, performance varies quite a bit:

  • Sometimes I get 1k commits/s total on the official PDSs and 3k commits/s on 3rd party (which include some newer official ones)
  • Usually I get 500 commits/s total on the official PDSs and maybe 400 commits/s on 3rd party
  • When autovacuum kicks in, I get 300-400 commits/s on official PDSs and 300 commits/s on 3rd party

(this was all without the record indexer running, which will probably slow things down a bit)

I guess I could cut down on autovacuum times by increasing partitioning by DID, which should make the record table partitions smaller overall?

In the first few hours I still had the record indexer running and got up to 150k repos/h total, but usually something more like 70k repos/h.

Unfortunately, the machine sometimes OOM-kills Postgres, which wreaks havoc on everything, and most notably means autovacuum is canceled (and restarted once everything is back up). I've adjusted oom_score_adj to -200 for postgres and 200 for the other components, and that seems to help a bit -- sometimes it kills the record indexer, not Postgres.
In general, I believe this is an unfortunate interaction of the ZFS ARC and the OOM killer, as described here: openzfs/zfs#10255
I've adjusted zfs_arc_sys_free to 3 GiB, which might have helped a bit, not sure. I might brute-force limit the ARC to, e.g., 10 GiB, but that requires a reboot. In general, htop is clever enough to understand the ARC and usually shows ample available memory, but I believe the OOM killer does not. The machine has 32GiB of memory in total. Maybe I've also messed up the memory limits for Postgres, which I now have at:

    command: [
      "-c", "max_connections=500",
      "-c", "max_parallel_workers_per_gather=8",
      "-c", "shared_buffers=8GB",
      "-c", "work_mem=2GB",
      "-c", "max_wal_size=4GB"
      ]
    shm_size: '10gb'
    oom_score_adj: -200

I've also adjusted the max. number of connections the DB pool of the record indexer keeps to 200 in hopes that might reduce memory usage of Postgres and make it less attractive to the OOM killer... but I'm poking around in the dark a bit, tbh.

@imax9000
Copy link
Contributor

imax9000 commented Dec 15, 2024

3k commits/s on 3rd party (which include some newer official ones)

Hm, I guess I updated regex since then to properly categorize PDSs. I've committed the current version of my dashboard just now, you can check it out too.

I believe this is an unfortunate interaction of the ZFS ARC and the OOM killer

Yeah, even though ZFS should free up some memory under pressure, to the virtual memory subsystem it looks like "allocated" instead of "buffers", so it can't treat it as available for reuse.

I've adjusted zfs_arc_sys_free to 3 GiB, which might have helped a bit, not sure. I might brute-force limit the ARC to, e.g., 10 GiB, but that requires a reboot.

I'm limiting ARC to 4GB on my machine, I don't want ZFS to eat too much RAM. And no, you don't have to reboot, you can adjust it at runtime:

sudo sh -c 'echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_max'

I've also adjusted the max. number of connections the DB pool of the record indexer keeps to 200 in hopes that might reduce memory usage of Postgres and make it less attractive to the OOM killer... but I'm poking around in the dark a bit, tbh.

Compose file still has max_connections=1000 though. I've removed a bunch of memory-related params in the commit 48f0429, where I added ScyllaDB container, idea being that postgres is now only used for PDS and repo metadata 😅 But since you're still using it for the records too - it makes sense for you to put them back in your docker-compose.override.yml (do note however that command gets replaced, not merged, so you need to copy over the params you want to keep from the main file. And it's worth to check that it all looks good with docker compose config before restarting things)

Sometimes I get 1k commits/s total on the official PDSs and 3k commits/s on 3rd party (which include some newer official ones)

4k commits/s is not too bad, should be good enough at least until the next big influx of users, and maybe even for some time afterwards.

I turned off the record indexer to catch up overnight, and I'm almost caught up now.

Did record-indexer finish the backfill? If it did, the load from it should be minimal and not impact consumer too much. What I was doing taking turns between consumer and record-indexer: let consumer catch up, switch to record-indexer for the night, rinse and repeat.

@mrd0ll4r
Copy link
Contributor Author

Hello again! I've had an eye on the dashboards and tweaked a few settings here and there. I'm now caught up and sailing along (until the disks are full, anyway).

This is my current postgres configuration:

    command: [
      "-c", "max_connections=500",
      "-c", "max_parallel_workers_per_gather=8",
      "-c", "shared_buffers=10GB",
      "-c", "work_mem=1GB",
      "-c", "maintenance_work_mem=1GB",
      "-c", "vacuum_buffer_usage_limit=128MB",
      "-c", "autovacuum_vacuum_cost_delay=0",
      "-c", "max_wal_size=4GB"
      ]
    shm_size: '16gb'
    oom_score_adj: -200

It still died occasionally while backfilling, but it's stable now.

I also turned full page writes off, according to some tutorial about Postgres on ZFS:

ALTER SYSTEM set full_page_writes=off;

Weirdly, while backfilling, the bottleneck seemed to be the disk. I'm running this on some cloud VM, which supposedly has NVMes. I got pretty much exactly 40 MB/s disk writes, which looked suspiciously like some throttling went on somewhere. I guess I'll go for bare metal next time. Or maybe I should play around with wider partitioning? Do you know whether I can un-partition and re-partition in place? Otherwise the disk won't be large enough I'm afraid...

Anyway, I did some CSV exports, which took a respectable 13 hours for all follows, likes, user profiles, and post languages. 😊

@imax9000
Copy link
Contributor

I also turned full page writes off, according to some tutorial about Postgres on ZFS:

Hm, yeah, it probably especially makes sense with compression enabled, since postgres's idea of block boundaries wouldn't match reality.

Overall you want it to never OOM :) Might be worth to tweak RAM limits for other containers: record-indexer can be spiky (it loads a full repo into memory at once, so a large repo can take a lot of memory), so reducing the limit for it will cause it to OOM earlier and not trigger the kernel into killing something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants