feat: three phased startup order #4399

koivunej · 2023-06-01T17:00:14Z

Problem

Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s).

Summary of changes

Implements the three step/tier initialization ordering described in #4397:

load local tenants
do initial logical sizes per walreceivers for 10s
background tasks

Ordering is controlled by:

waiting on utils::completion::Barriers on background tasks
having one attempt for each Timeline to do initial logical size calculation
pageserver/src/bin/pageserver.rs releasing background jobs after timeout or completion of initial logical size calculation

The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

on the deployment of 2023-06-01 it was noticed that eviction_task got started during initial activation. this is possible because random_init_delay can get a low number, and perfectly fine. this commit fixes it so, that it's also let continue only after initial load has completed.

i feel bad that we have to repeat it but ... if it was a "scheduler" then we'd have to still wait on it.

now the initialization sequence or order is: 1. load local tenants 2. do initial logical sizes per walreceivers 3. background tasks additionally, background tasks are delayed by random init delay.

koivunej · 2023-06-01T17:26:27Z

Only a few test failures :) (44 -> 1 -> hopefully to 0 with f92d0ae)

github-actions · 2023-06-01T17:27:04Z

1000 tests run: 960 passed, 0 failed, 40 skipped (full report)

Flaky tests (2)

Postgres 15

test_remote_storage_upload_queue_retries[local_fs]: ✅ debug
test_threshold_based_eviction: ✅ debug

_{The comment gets automatically updated with the latest test results
76d0ef1 at 2023-06-06T16:58:09.039Z :recycle:}

pageserver/src/tenant/timeline.rs

refactoring is nasty, because addition of this third parameter was done by making the InitializationOrder mutable, taking out the initial_tenant_load completion "tracker", and then passing on a borrow to InitializationOrder.

with cancellation. this was another oversight.

earlier I was thinking it needed changes, but it no. it was just my bug which caused the need.

just as a safeguard, in case activation breaks tenant or timeline.

took some finding out, initially remembered the safekeepers might be down, but it's just the lack of updates and so connections which causes lack of connecting.

include comment per review. Co-authored-by: Christian Schwarz <[email protected]>

koivunej · 2023-06-06T15:18:01Z

IMO while the tenant::mgr::TENANTS is in state Uninitialized, we should fail with some HTTP status that indicates come back later, or, not expose the HTTP API at all.

Could do that on a separate PR, but sounds trivial to delay.

better comment Co-authored-by: Christian Schwarz <[email protected]>

Co-authored-by: Christian Schwarz <[email protected]>

koivunej · 2023-06-06T16:43:15Z

IMO while the tenant::mgr::TENANTS is in state Uninitialized, we should fail with some HTTP status that indicates come back later, or, not expose the HTTP API at all.

Could do that on a separate PR, but sounds trivial to delay.

We already delay, because the init_tenant_mgr is done inside a BACKGROUND_RUNTIME.block_on so it will block after having bound to the http port. After init_tenant_mgr the TENANTS is Open. Later we will start the hyper+routerify to actually accept new connections, so we are all good already.

as discussed on #4399 (comment)

koivunej · 2023-06-06T17:13:49Z

The limiters to this is other background tasks (initial compaction), eviction task, there is no semaphore for limiting these. Then on next round metrics collection will have the logical sizes. I don't think there's any problem here.

The problem is that metrics collection is N(=1) tenant at a time right now. So, if there is at least one tenant without an active compute, then this PR will delay background task launch needlessly, until metrics collection does get_current_logical_size() on that tenant's timelines. Due to the N=1, that might take a long time.

Discussed this off-github, metrics collection doesn't wait for anything (except mutexes, http POST), so all timelines which haven't yet had their logical sizes computed will get them queued up practically at the same time.

problame · 2023-06-06T17:15:47Z

Created follow-up to my comment about tenant::mgr Uninitialized handling: #4433

EDIT: now I read #4399 (comment) so this is not a pressing issue.

koivunej · 2023-06-06T20:33:07Z

Is the 10s value a random value or based on the reconnect timeout (+ backoff?) of the active computes?

Oops forgot to answer to this. It's just a guess, Harrison-Stetson method. Safekeepers push updates to storage_broker every 1s or so, so we'd have some chances to get these, and then to connect to safekeepers, and then to initiate the init logical size calculations. Hopefully the first seconds doing the tenant loading will allow us to see the first updates on storage_broker, and then the logical sizes will be awaiting on the barrier.

When the 10s runs out, we might not have finished every init size calculation, but at least the tasks are queued up, delaying background tasks "naturally".

koivunej · 2023-06-07T08:18:06Z

I don't think the cloud-e2e timeout is caused by the changes, because the pageservers are not restarted, and background tasks get to start immediatedly because the pageservers are empty.

Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s). Implements the three step/tier initialization ordering described in #4397: 1. load local tenants 2. do initial logical sizes per walreceivers for 10s 3. background tasks Ordering is controlled by: - waiting on `utils::completion::Barrier`s on background tasks - having one attempt for each Timeline to do initial logical size calculation - `pageserver/src/bin/pageserver.rs` releasing background jobs after timeout or completion of initial logical size calculation The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible. Co-authored-by: Christian Schwarz <[email protected]>

Refactor the `!completed` to be about `Option<_>` instead, side-stepping any boolean true/false or false/true. As discussed on #4399 (comment)

koivunej added 3 commits June 1, 2023 16:49

doc: repeat the doc comment in places

f4a8c87

i feel bad that we have to repeat it but ... if it was a "scheduler" then we'd have to still wait on it.

feat: delay background tasks after init logical sizes

05dcf2b

now the initialization sequence or order is: 1. load local tenants 2. do initial logical sizes per walreceivers 3. background tasks additionally, background tasks are delayed by random init delay.

koivunej requested review from a team as code owners June 1, 2023 17:00

koivunej requested review from lubennikovaav and shanyp and removed request for a team June 1, 2023 17:00

koivunej added 2 commits June 1, 2023 20:24

chore: rustfmt

217d5ad

fix: make background task max delay configurable

a7f7fe9

fix: make initial background job delay cancellable

5f47cd9

koivunej marked this pull request as draft June 1, 2023 17:53

koivunej added 2 commits June 1, 2023 21:00

test: no I was paying attention with PageserverConf

bfe303d

test: lower limit for background_task_maximum_delay

f92d0ae

koivunej commented Jun 1, 2023

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

koivunej added 9 commits June 1, 2023 22:31

fix: introduce the logical sizes can start

aa30c93

refactoring is nasty, because addition of this third parameter was done by making the InitializationOrder mutable, taking out the initial_tenant_load completion "tracker", and then passing on a borrow to InitializationOrder.

fix: make tenant task wait cancellable

9eae42d

fix: make eviction task cancellable

f1132e9

fix: make initial logical size calc wait cancellable

c6ad1c3

fix: make metrics and dube use background_tasks barrier

ab771a1

with cancellation. this was another oversight.

chore: redundant clone

365f983

fix: millis logging

66cd42b

revert: threshold_based_eviction test does not need changes

0814626

earlier I was thinking it needed changes, but it no. it was just my bug which caused the need.

fix: drop the Timeline token on transitions from Active

a62c1c4

just as a safeguard, in case activation breaks tenant or timeline.

koivunej marked this pull request as ready for review June 1, 2023 20:09

koivunej added 2 commits June 1, 2023 23:38

fix: drop completion when transitioning to Stopping or Broken

8722655

test: smaller scope for background_task_maximum_delay

f65507c

took some finding out, initially remembered the safekeepers might be down, but it's just the lack of updates and so connections which causes lack of connecting.

koivunej and others added 6 commits June 6, 2023 17:53

doc: move comment about initial logical sizes can start

a8a5d79

refactor: move scopeguard to be like others

5105594

doc: move comment into PageserverConf

b7f5640

refactor: rename shutdown to shutdown_pageserver

8cd2841

include comment per review. Co-authored-by: Christian Schwarz <[email protected]>

doc: Timeline::init_log_s_a: created/loaded

0b32879

Merge branch 'main' into try_speedup_startup4

a8531b5

koivunej and others added 4 commits June 6, 2023 18:19

doc: test_runner/regress/test_disk_usage_eviction.py

868c413

better comment Co-authored-by: Christian Schwarz <[email protected]>

refactor: completed: bool => Option

4916fc2

chore: unify logging without sentences

d8162a6

doc: better wording

76d0ef1

Co-authored-by: Christian Schwarz <[email protected]>

koivunej requested a review from problame June 6, 2023 16:14

koivunej added a commit that referenced this pull request Jun 6, 2023

refactor: to pattern of await after timeout

d4870c5

as discussed on #4399 (comment)

koivunej mentioned this pull request Jun 6, 2023

refactor: to pattern of await after timeout #4432

Merged

problame approved these changes Jun 6, 2023

View reviewed changes

problame mentioned this pull request Jun 6, 2023

while tenant mgr is Uninitialized, fail / delay any HTTP requests concerning tenants #4433

Open

koivunej enabled auto-merge (squash) June 6, 2023 18:47

koivunej merged commit 5761190 into main Jun 7, 2023

koivunej deleted the try_speedup_startup4 branch June 7, 2023 11:29

koivunej mentioned this pull request Jun 27, 2023

test: cleanup workaround #4380

Draft

koivunej added a commit that referenced this pull request Jun 28, 2023

refactor: to pattern of await after timeout (#4432)

02ef246

Refactor the `!completed` to be about `Option<_>` instead, side-stepping any boolean true/false or false/true. As discussed on #4399 (comment)

koivunej mentioned this pull request Jul 19, 2023

remote_storage: revisit throttling/ratelimiting #3698

Open

This was referenced Aug 4, 2023

add pageserver SLO for startup performance: tenant load & time-to-active #4083

Open

Excessive tenant initial load times on large pageserver after restart #4025

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: three phased startup order #4399

feat: three phased startup order #4399

koivunej commented Jun 1, 2023 •

edited

Loading

koivunej commented Jun 1, 2023 •

edited

Loading

github-actions bot commented Jun 1, 2023 •

edited

Loading

Postgres 15

koivunej commented Jun 6, 2023

koivunej commented Jun 6, 2023

koivunej commented Jun 6, 2023

problame commented Jun 6, 2023 •

edited

Loading

koivunej commented Jun 6, 2023

koivunej commented Jun 7, 2023

feat: three phased startup order #4399

feat: three phased startup order #4399

Conversation

koivunej commented Jun 1, 2023 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

koivunej commented Jun 1, 2023 • edited Loading

github-actions bot commented Jun 1, 2023 • edited Loading

1000 tests run: 960 passed, 0 failed, 40 skipped (full report)

Postgres 15

koivunej commented Jun 6, 2023

koivunej commented Jun 6, 2023

koivunej commented Jun 6, 2023

problame commented Jun 6, 2023 • edited Loading

koivunej commented Jun 6, 2023

koivunej commented Jun 7, 2023

koivunej commented Jun 1, 2023 •

edited

Loading

koivunej commented Jun 1, 2023 •

edited

Loading

github-actions bot commented Jun 1, 2023 •

edited

Loading

problame commented Jun 6, 2023 •

edited

Loading