Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failure in test_restart_compute test #409

Closed
hlinnaka opened this issue Aug 11, 2021 · 6 comments
Closed

CI failure in test_restart_compute test #409

hlinnaka opened this issue Aug 11, 2021 · 6 comments
Assignees
Labels
c/storage/safekeeper Component: storage: safekeeper

Comments

@hlinnaka
Copy link
Contributor

This happened as part of PR #405, but seems unrelated to the code changes:

https://app.circleci.com/pipelines/github/zenithdb/zenith/1098/workflows/a752b1e3-2771-405d-92d4-6231d002357d/jobs/5938

=================================== FAILURES ===================================
__________________________ test_restart_compute[True] __________________________
batch_others/test_restart_compute.py:57: in test_restart_compute
    with closing(pg.connect()) as conn:
fixtures/zenith_fixtures.py:106: in connect
    conn = psycopg2.connect(self.connstr(**kwargs))
../../.local/share/virtualenvs/test_runner-sNYEMCQ_/lib/python3.8/site-packages/psycopg2/__init__.py:122: in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
E   psycopg2.OperationalError: server closed the connection unexpectedly
E   	This probably means the server terminated abnormally
E   	before or while processing the request.
@hlinnaka
Copy link
Contributor Author

@knizhnik, is this the same or related as issue #395 ?

@kelvich
Copy link
Contributor

kelvich commented Aug 11, 2021

Looks like that is caused by this morning fixes to #395. @arssher looking investigating problem itself and meanwhile we decided to rollback that commits and commit fixed version later.

@knizhnik
Copy link
Contributor

@knizhnik, is this the same or related as issue #395 ?

The fixes I have made this morning seems to be very small an obvious:

  1. Align prev LSN on 8
  2. Update walkeeper state not only on the disk, but also in memory
    I can not reproduce the problem locally. May be the problem is caused by not updated version of postgres submodule...

@arssher arssher added the c/storage/safekeeper Component: storage: safekeeper label Aug 12, 2021
@arssher
Copy link
Contributor

arssher commented Aug 12, 2021

Failure happens due to could not receive data from WAL stream error in walproposer during recovery which forces it to restart over and over again while trying to fetch WAL from the most advanced safekeeper (so WAL doesn't get to pageserver => get_page_at_lsn hangs => auth fails after timeout). I suspect this is caused by incorrect COPY protocol termination (there should be a dance like CopyDone followed by one or two CommandCompletes which we don't do) by replication.rs, but not exactly sure.

Rolling back yesterday 0ee2e16 was quite a stupid action from me; that commit doesn't have to do anything with this (#409) problem, it is a floating bug and it is just a coincidence that test passed on my laptop and CI after revert. Anyway, 1) 0ee2e16 is itself incorrect 2) the whole thing is rewritten in #315. Today I can't reproduce could not receive data from WAL stream as well. I think it is better now to get back to the plan of merging #366, then #315 and then reconsidering this, otherwise we are kind of searching for new bugs among known ones.

@kelvich
Copy link
Contributor

kelvich commented Aug 12, 2021

I think it is better now to get back to the plan of merging #366, then #315 and then reconsidering this, otherwise we are kind of searching for new bugs among known ones.

+1

@kelvich
Copy link
Contributor

kelvich commented Sep 2, 2021

Main ticket for related activities is #395, closing as dup

@kelvich kelvich closed this as completed Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/safekeeper Component: storage: safekeeper
Projects
None yet
Development

No branches or pull requests

4 participants