Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: three phased startup order #4399

Merged
merged 36 commits into from
Jun 7, 2023
Merged

feat: three phased startup order #4399

merged 36 commits into from
Jun 7, 2023

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Jun 1, 2023

Problem

Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s).

Summary of changes

Implements the three step/tier initialization ordering described in #4397:

  1. load local tenants
  2. do initial logical sizes per walreceivers for 10s
  3. background tasks

Ordering is controlled by:

  • waiting on utils::completion::Barriers on background tasks
  • having one attempt for each Timeline to do initial logical size calculation
  • pageserver/src/bin/pageserver.rs releasing background jobs after timeout or completion of initial logical size calculation

The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

koivunej added 3 commits June 1, 2023 16:49
on the deployment of 2023-06-01 it was noticed that eviction_task got
started during initial activation. this is possible because
random_init_delay can get a low number, and perfectly fine.

this commit fixes it so, that it's also let continue only after initial
load has completed.
i feel bad that we have to repeat it but ... if it was a "scheduler"
then we'd have to still wait on it.
now the initialization sequence or order is:
1. load local tenants
2. do initial logical sizes per walreceivers
3. background tasks

additionally, background tasks are delayed by random init delay.
@koivunej koivunej requested review from a team as code owners June 1, 2023 17:00
@koivunej koivunej requested review from lubennikovaav and shanyp and removed request for a team June 1, 2023 17:00
@koivunej
Copy link
Member Author

koivunej commented Jun 1, 2023

Only a few test failures :) (44 -> 1 -> hopefully to 0 with f92d0ae)

@github-actions
Copy link

github-actions bot commented Jun 1, 2023

1000 tests run: 960 passed, 0 failed, 40 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_remote_storage_upload_queue_retries[local_fs]: ✅ debug
  • test_threshold_based_eviction: ✅ debug
The comment gets automatically updated with the latest test results
76d0ef1 at 2023-06-06T16:58:09.039Z :recycle:

@koivunej koivunej marked this pull request as draft June 1, 2023 17:53
koivunej added 9 commits June 1, 2023 22:31
refactoring is nasty, because addition of this third parameter was done
by making the InitializationOrder mutable, taking out the
initial_tenant_load completion "tracker", and then passing on a borrow
to InitializationOrder.
with cancellation. this was another oversight.
earlier I was thinking it needed changes, but it no. it was just my bug
which caused the need.
just as a safeguard, in case activation breaks tenant or timeline.
@koivunej koivunej marked this pull request as ready for review June 1, 2023 20:09
koivunej added 2 commits June 1, 2023 23:38
took some finding out, initially remembered the safekeepers might be
down, but it's just the lack of updates and so connections which causes
lack of connecting.
@koivunej
Copy link
Member Author

koivunej commented Jun 6, 2023

IMO while the tenant::mgr::TENANTS is in state Uninitialized, we should fail with some HTTP status that indicates come back later, or, not expose the HTTP API at all.

Could do that on a separate PR, but sounds trivial to delay.

@koivunej koivunej requested a review from problame June 6, 2023 16:14
@koivunej
Copy link
Member Author

koivunej commented Jun 6, 2023

IMO while the tenant::mgr::TENANTS is in state Uninitialized, we should fail with some HTTP status that indicates come back later, or, not expose the HTTP API at all.

Could do that on a separate PR, but sounds trivial to delay.

We already delay, because the init_tenant_mgr is done inside a BACKGROUND_RUNTIME.block_on so it will block after having bound to the http port. After init_tenant_mgr the TENANTS is Open. Later we will start the hyper+routerify to actually accept new connections, so we are all good already.

koivunej added a commit that referenced this pull request Jun 6, 2023
@koivunej
Copy link
Member Author

koivunej commented Jun 6, 2023

The limiters to this is other background tasks (initial compaction), eviction task, there is no semaphore for limiting these. Then on next round metrics collection will have the logical sizes. I don't think there's any problem here.

The problem is that metrics collection is N(=1) tenant at a time right now. So, if there is at least one tenant without an active compute, then this PR will delay background task launch needlessly, until metrics collection does get_current_logical_size() on that tenant's timelines. Due to the N=1, that might take a long time.

Discussed this off-github, metrics collection doesn't wait for anything (except mutexes, http POST), so all timelines which haven't yet had their logical sizes computed will get them queued up practically at the same time.

@problame
Copy link
Contributor

problame commented Jun 6, 2023

Created follow-up to my comment about tenant::mgr Uninitialized handling: #4433

EDIT: now I read #4399 (comment) so this is not a pressing issue.

@koivunej koivunej enabled auto-merge (squash) June 6, 2023 18:47
@koivunej
Copy link
Member Author

koivunej commented Jun 6, 2023

Is the 10s value a random value or based on the reconnect timeout (+ backoff?) of the active computes?

Oops forgot to answer to this. It's just a guess, Harrison-Stetson method. Safekeepers push updates to storage_broker every 1s or so, so we'd have some chances to get these, and then to connect to safekeepers, and then to initiate the init logical size calculations. Hopefully the first seconds doing the tenant loading will allow us to see the first updates on storage_broker, and then the logical sizes will be awaiting on the barrier.

When the 10s runs out, we might not have finished every init size calculation, but at least the tasks are queued up, delaying background tasks "naturally".

@koivunej
Copy link
Member Author

koivunej commented Jun 7, 2023

I don't think the cloud-e2e timeout is caused by the changes, because the pageservers are not restarted, and background tasks get to start immediatedly because the pageservers are empty.

@koivunej koivunej merged commit 5761190 into main Jun 7, 2023
@koivunej koivunej deleted the try_speedup_startup4 branch June 7, 2023 11:29
awestover pushed a commit that referenced this pull request Jun 14, 2023
Initial logical size calculation could still hinder our fast startup
efforts in #4397. See #4183. In deployment of 2023-06-06
about a 200 initial logical sizes were calculated on hosts which
took the longest to complete initial load (12s).

Implements the three step/tier initialization ordering described in
#4397:
1. load local tenants
2. do initial logical sizes per walreceivers for 10s
3. background tasks

Ordering is controlled by:
- waiting on `utils::completion::Barrier`s on background tasks
- having one attempt for each Timeline to do initial logical size
calculation
- `pageserver/src/bin/pageserver.rs` releasing background jobs after
timeout or completion of initial logical size calculation

The timeout is there just to safeguard in case a legitimate non-broken
timeline initial logical size calculation goes long. The timeout is
configurable, by default 10s, which I think would be fine for production
systems. In the test cases I've been looking at, it seems that these
steps are completed as fast as possible.

Co-authored-by: Christian Schwarz <[email protected]>
koivunej added a commit that referenced this pull request Jun 28, 2023
Refactor the `!completed` to be about `Option<_>` instead, side-stepping
any boolean true/false or false/true. As discussed on
#4399 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants