-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use postgres protocol for postgres <-> walkeeper
communication
#366
Conversation
I've now added the walproposer modifications in postgres -- that's sitting in neondatabase/postgres#60. Still untested. EDIT: It built fine on my machine - I'll have to figure out what it's missing in CI. Meanwhile... here's a summary of what's new :) Things making this implementation more complicated than beforeChanging the protocol to work over libpq fundamentally only directly required a few changes:
The extra work comes in because the operations now had many different ways they could be asynchronous -- instead of just using
Basically, the logic is significantly more complicated. The new implementationTo handle the complexity from above, there's now more pieces to a state! Instead of just the previous
So the new polling function is then roughly: def AdvancePollState(walkeeper, events):
while "first iteration" or walkeeper.pollState == "no waiting required":
AssertEventsMatchesExpected(events, walkeeper.sockWaitState)
# The first big switch statement
if requiresIOpolling(walkeeper.pollState):
DoPollingLogic(walkeeper)
if "still requires IO polling":
return
# Labeled as "ExecuteNextProtocolState"
#
# Also internally a big switch statement, handling the "starting" logic for every state.
DoStateLogic(walkeeper)
events = None # only one event actually occured The Further improvements / Open questionsThere's currently what feels like a lot of repetition inside Also: assuming that the structure of |
Woo-hoo! Congrats with passing current tests suite. I tried to run script https://gist.github.com/kelvich/706757e001012bab0605c01e1b5ea1c6 on this branch and got same issues that were fixed few days ago in the main branch. Which is expected, given that this branch (and corresponding postgres branch) was not rebased on the main. I'm +1 with committing this in it's current form after rebase and address refactoring in follow up PR's. |
The remaining failures are on the following tests:
As far as I can tell, these are unrelated to this PR -- @kelvich, if these look like the same errors to you, perhaps this is ready to be merged anyways? Edit: Looks like CI re-ran when I changed this PR to no longer be a draft, and the errors happened to not occur this time. |
Yes, I think it can be merged after squashing commits. Tests are also ok here. Just to be clear, I still strongly feel C part ought to be significantly refactored to be more simple, but let's merge for the sake of working on #315 and other things. |
Most of the work here was done on the postgres side. There's more information in the commit message there. (see: neondatabase/postgres@04cfa32) On the WAL acceptor side, we're now expecting 'START_WAL_PUSH' to initialize the WAL keeper protocol. Everything else is mostly the same, with the only real difference being that protocol messages are now discrete CopyData messages sent over the postgres protocol. For the sake of documentation, the full set of these messages is: <- recv: START_WAL_PUSH query <- recv: server info from postgres (type `ServerInfo`) -> send: walkeeper info (type `SafeKeeperInfo`) <- recv: vote info (type `RequestVote`) if node id mismatch: -> send: self node id (type `NodeId`); exit -> send: confirm vote (with node id) (type `NodeId`) loop: <- recv: info and maybe WAL block (type `SafeKeeperRequest` + bytes) (break loop if done) -> send: confirm receipt (type `SafeKeeperResponse`)
eddcb09
to
ccb099d
Compare
Single failure in CI - this seems unrelated to the PR, but I'm not sure. If it is related, there's something to do with the change that's causing a WAL acceptor to have exited earlier than expected.
|
Actually this seems to just be the same issue that #417 is addressing Edit to add: in light of this, going to go ahead with merging. |
On 13 August 2021 19:53:28 EEST, Max Sharnoff ***@***.***> wrote:
Single failure in CI - this *seems* unrelated to the PR, but I'm not sure. If it is related, there's something to do with the change that's causing a WAL acceptor to have exited earlier than expected.
```
=================================== FAILURES ===================================
________________________________ test_restarts _________________________________
batch_others/test_wal_acceptor.py:88: in test_restarts
failed_node.stop()
fixtures/zenith_fixtures.py:550: in stop
pid = read_pid(pidfile_path)
fixtures/zenith_fixtures.py:516: in read_pid
return int(Path(path).read_text())
E ValueError: invalid literal for int() with base 10: ''
```
Sounds like #417
- Heikki
|
This PR will encompass the first of two changes for #117.
Note: This is a work-in-progress; the postgres-side hasn't been updated yet, and so it is expected to not work yet.
Notes on messages:
Because we're now using
CopyData
messages, there are now distinct boundaries for messages where there weren't before. What this means is that it's worth clarifying where those boundaries are -- here for now, and then officially withinwalkeeper/README_PROTO.md
or somewhere else when they're decided upon.The messages are, from
walkeeper
's perspective: