-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: three phased startup order #4399
Conversation
on the deployment of 2023-06-01 it was noticed that eviction_task got started during initial activation. this is possible because random_init_delay can get a low number, and perfectly fine. this commit fixes it so, that it's also let continue only after initial load has completed.
i feel bad that we have to repeat it but ... if it was a "scheduler" then we'd have to still wait on it.
now the initialization sequence or order is: 1. load local tenants 2. do initial logical sizes per walreceivers 3. background tasks additionally, background tasks are delayed by random init delay.
Only a few test failures :) (44 -> 1 -> hopefully to 0 with f92d0ae) |
1000 tests run: 960 passed, 0 failed, 40 skipped (full report)Flaky tests (2)Postgres 15The comment gets automatically updated with the latest test results
76d0ef1 at 2023-06-06T16:58:09.039Z :recycle: |
refactoring is nasty, because addition of this third parameter was done by making the InitializationOrder mutable, taking out the initial_tenant_load completion "tracker", and then passing on a borrow to InitializationOrder.
with cancellation. this was another oversight.
earlier I was thinking it needed changes, but it no. it was just my bug which caused the need.
just as a safeguard, in case activation breaks tenant or timeline.
took some finding out, initially remembered the safekeepers might be down, but it's just the lack of updates and so connections which causes lack of connecting.
include comment per review. Co-authored-by: Christian Schwarz <[email protected]>
Could do that on a separate PR, but sounds trivial to delay. |
better comment Co-authored-by: Christian Schwarz <[email protected]>
Co-authored-by: Christian Schwarz <[email protected]>
We already delay, because the |
as discussed on #4399 (comment)
Discussed this off-github, metrics collection doesn't wait for anything (except mutexes, http POST), so all timelines which haven't yet had their logical sizes computed will get them queued up practically at the same time. |
Created follow-up to my comment about tenant::mgr Uninitialized handling: #4433 EDIT: now I read #4399 (comment) so this is not a pressing issue. |
Oops forgot to answer to this. It's just a guess, Harrison-Stetson method. Safekeepers push updates to storage_broker every 1s or so, so we'd have some chances to get these, and then to connect to safekeepers, and then to initiate the init logical size calculations. Hopefully the first seconds doing the tenant loading will allow us to see the first updates on storage_broker, and then the logical sizes will be awaiting on the barrier. When the 10s runs out, we might not have finished every init size calculation, but at least the tasks are queued up, delaying background tasks "naturally". |
I don't think the cloud-e2e timeout is caused by the changes, because the pageservers are not restarted, and background tasks get to start immediatedly because the pageservers are empty. |
Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s). Implements the three step/tier initialization ordering described in #4397: 1. load local tenants 2. do initial logical sizes per walreceivers for 10s 3. background tasks Ordering is controlled by: - waiting on `utils::completion::Barrier`s on background tasks - having one attempt for each Timeline to do initial logical size calculation - `pageserver/src/bin/pageserver.rs` releasing background jobs after timeout or completion of initial logical size calculation The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible. Co-authored-by: Christian Schwarz <[email protected]>
Refactor the `!completed` to be about `Option<_>` instead, side-stepping any boolean true/false or false/true. As discussed on #4399 (comment)
Problem
Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s).
Summary of changes
Implements the three step/tier initialization ordering described in #4397:
Ordering is controlled by:
utils::completion::Barrier
s on background taskspageserver/src/bin/pageserver.rs
releasing background jobs after timeout or completion of initial logical size calculationThe timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible.
Checklist before requesting a review
Checklist before merging