Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to be more intelligent in WalAcceptor.stop #417

Merged
merged 1 commit into from
Aug 16, 2021

Conversation

LizardWizzard
Copy link
Contributor

Original error reported by @hlinnaka:

=================================== FAILURES ===================================
________________________________ test_restarts _________________________________
batch_others/test_wal_acceptor.py:88: in test_restarts
    failed_node.stop()
fixtures/zenith_fixtures.py:550: in stop
    pid = read_pid(pidfile_path)
fixtures/zenith_fixtures.py:516: in read_pid
    return int(Path(path).read_text())
E   ValueError: invalid literal for int() with base 10: ''

This can occur when pidfile exists but actual pid value is not yet written to it. This PR attempts to detect this error, wait for half a second and try to read pid value again. I haven't been able to reproduce it locally despite running this test for ~120 times.

Also, I've added a bunch of typing sugar to wal acceptor fixtures

@arssher
Copy link
Contributor

arssher commented Aug 12, 2021

If this is indeed a problem as it seems to be, I'd better wait in WalAcceptor->start for the pid file actually become valid (and remove it before the start if it exists -- this doesn't protect us from concurrent start, but should be enough for test purposes).

Though probability of it on not-CPU-hungry machine must be really small -- it means safekeeper couldn't fill pid while test driver completely performed INSERT.

@LizardWizzard
Copy link
Contributor Author

@arssher Maybe even better solution is to use IDENTIFY_SYSTEM query to wal acceptor as a readiness criteria, what do you think? So in wal acceptor start, wait for connection and IDENTIFY_SYSTEM query to succeed before going further

@arssher
Copy link
Contributor

arssher commented Aug 12, 2021

Such approach is an option, but 1) we don't necessarily know ztimelineid on the moment of acceptor start 2) tests don't actually need "ready" acceptor, so sounds more complicated than needed, but possible.

@LizardWizzard LizardWizzard force-pushed the feature/heal-test-restarts branch from fc10714 to 4470780 Compare August 13, 2021 09:47
@LizardWizzard
Copy link
Contributor Author

I decided to go with pid checking, general health checking mechanism is a good thing on its own, but it is not needed yet

@arssher
Copy link
Contributor

arssher commented Aug 13, 2021

Without removing pid file before start we risk going through the same race (reaching stop() before pid was written to the file), try to stop safekeeper using obsolete pid and then leaving it to prowl around. Starting another one later will fail to lock the control file.

Ideally safekeeper should cleanup pidfile on its own, but while we don't have this please add the line removing the pidfile in start and feel free to merge this.

@LizardWizzard LizardWizzard force-pushed the feature/heal-test-restarts branch from 73f9fa9 to c4e93fc Compare August 16, 2021 10:11
@LizardWizzard LizardWizzard force-pushed the feature/heal-test-restarts branch from c4e93fc to 27079fe Compare August 16, 2021 11:02
@LizardWizzard LizardWizzard merged commit 0c4ab80 into main Aug 16, 2021
@LizardWizzard LizardWizzard deleted the feature/heal-test-restarts branch August 16, 2021 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants