-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add neon.primary_is_running GUC. #6705
Conversation
2478 tests run: 2357 passed, 0 failed, 121 skipped (full report)Flaky tests (2)Postgres 14Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
1908692 at 2024-02-23T13:29:22.947Z :recycle: |
For the sake of the archives, please explain the problem in the commit message too. And perhaps open a github issue to track it. IIUC the problem is that:
A hot standby will wait for RUNNING_XACTS or shutdown checkpoint records until it can allow connections. If the primary is not running, they will never come. This PR adds a GUC to tell the standby that there is no primary running. It doesn't actually do anything with it. Will that be a followup PR? Even with this flag, this can still happen:
So just checking that a primary is running when the standby starts isn't necessarily enough. A GUC doesn't feel like a very good mechanism to deliver this information. Whether a primary is running or not is very ephemeral. Perhaps the neon extension should ask the control plane for that directly with an API call? |
Hot-standby replica is waiting for primary only if was not started after normal shutdown (wasShutdown true in xlogrecovery.c). This behaviour was changed only by my PR #6357. And I am also going to handle added GUC in this PR.
Primary has to be restarted in any case. Actually replica has no other choice rather than wait for primary once it starts recovery - receiving WAL from primary.
I do not know better solution.
The code making decision whether to set I do not understand why sending request to control plane is better than sending this information through compute spec and then GUC. Also extra round-trip doesn't speedup startup. |
Hmm, looks like we can get into a deadlock here. Imagine this situation:
Are we OK with this? |
Until #6712 is merged, we don't actually write the shutdown checkpoint record even on a clean shutdown, so you'll get the above with a clean shutdown too. |
I see. Can you include those changes here, please, so that this can be reviewed, tested and committed as one unit, please? Needs tests. |
So we currently have this bug:
Please create a python test case for that. This PR should then fix it. |
There is such test test_runner/regress/test_replication_start.py in PR #6357: |
I can shuffle changes between PR as you prefer. So #6357 solves the problem with spawning replica with alive master. And this PR (in conjunction with correspondent PR for control plane) solves the problem with spawning replica without alive master.
|
Thank you. The general rule should be: one bug, one PR. PR includes all code changes required to fix the bug, and a test to reproduce it, but nothing else.
The point of this PR is to fix that bug, right?
According to the description of #6357, it is about propagating apply_lsn from safekeepers to pageservers, to postpone GC. That seems completely unrelated to this.
Gotcha.
Yes, it does. We need regression tests for these things. If we go with a GUC, it's easy to set the GUC in a test to simulate how the control plane would set it. To summarize, for this PR:
|
Done |
e2e tests will not pass before https://github.com/neondatabase/cloud/pull/10183 is merged |
…mary is not alive (#364) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
…mary is not alive (#363) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
…mary is not alive (#365) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>
140a02a
to
5bc9c53
Compare
We set it for neon replica, if primary is running
Co-authored-by: Heikki Linnakangas <[email protected]>
795d162
to
05aa7f3
Compare
05aa7f3
to
1908692
Compare
ref #7204 |
…mary is not alive (#365) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>
…mary is not alive (#364) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
…mary is not alive (#363) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
…mary is not alive (#364) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
…mary is not alive (#365) * Set wasShutdown=true during hot-standby replica startup only when primary is not alive * Report fatal error if hot standaby replica is started with oldestAcriveXid=0 Postgres part of neondatabase/neon#6705 --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>
We set it for neon replica, if primary is running
Corresponding cloud PR is https://github.com/neondatabase/cloud/pull/10183