Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests should sometimes cleanly shut nodes down #35585

Open
DaveCTurner opened this issue Nov 15, 2018 · 13 comments
Open

Integration tests should sometimes cleanly shut nodes down #35585

DaveCTurner opened this issue Nov 15, 2018 · 13 comments
Labels
:Delivery/Build Build or test infrastructure >enhancement Team:Delivery Meta label for Delivery team

Comments

@DaveCTurner
Copy link
Contributor

Today in our real-node integration tests we shut nodes down with kill -9 $PID, i.e. sending SIGKILL to the process, which shuts it down immediately. Although it is very important to test the behaviour of nodes shutting down like this, I think the fact that we always use SIGKILL means we're failing to properly cover the normal shutdown procedure that happens in response to a less aggressive signal.

I think we should also send SIGTERM in these tests.

@DaveCTurner DaveCTurner added :Delivery/Build Build or test infrastructure team-discuss labels Nov 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@alpar-t
Copy link
Contributor

alpar-t commented Nov 15, 2018

Test clusters will terminate with TERM by default.
( I just wrote the code for it, will have a PR shortly ).

@alpar-t
Copy link
Contributor

alpar-t commented Nov 21, 2018

After discussing with the team we reached the conclusion that we do want to have tests that use TERM at some point. When the tests are ported to testclusters plugin this will be easier to do so we would want to delay this until that time.
We kill with KILL by default in the interest of time because we are not interested in what happens after the test.

@alpar-t
Copy link
Contributor

alpar-t commented Sep 5, 2019

@DaveCTurner do you think we should send TERM on a list of specific tests or just implement some tests specifically for this ?

@DaveCTurner
Copy link
Contributor Author

Do we have access to deterministic randomness at the point we're making the choice? I think all tests should work correctly whether the nodes are stopped with TERM or KILL (or a mixture) so the best would be to choose randomly on each test, weighting KILL more heavily since we prefer that for speed reasons.

@jasontedor
Copy link
Member

I would prefer that we not introduce that kind of randomness into the build. We use to have that kind of randomness all over the place in the build (which GC to use, which JVM flags to use, etc.) and it made debugging a nightmare.

What is the plan with "testing" that we work correctly with SIGTERM? We don't want the build to wait forever, but if we introduce a timeout, then we are adding another vector for flaky tests where we end up failing the build because the process didn't shutdown quickly enough after receiving SIGTERM. Are there post-assertions that we would be making, or only that the process eventually shutdown?

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Sep 5, 2019

The thing I think we should be checking is that node(s) that were shut down with SIGTERM come up and join the cluster again: particularly full-cluster-restart and rolling-upgrade. I think we don't currently have any that restart a cluster without upgrading, but if we did then the same thing would apply there too.

Sorry, this was not clear from the OP. We don't really care about the behaviour of a node that shuts down never to start again, so SIGKILL is fine for them.

I don't think failing to shut down within a timeout should be considered a failure; in fact I think we should also be checking that a SIGTERM followed by a SIGKILL results in correct behaviour on a restart.

@alpar-t
Copy link
Contributor

alpar-t commented Sep 5, 2019

@DaveCTurner all the bwc tests have a version where they from current version to current version with no actual upgrade in-between.

It would be nice to have focused tests. Correct me if I'm wrong but it would seem that in this case the upgrade might be relevant, but the tests ran against the nodes are not really relevant as we are just looking to make sure the cluster is formed. Full cluster restart vs rolling doesn't seem particularly relevant either.

Thus we could have a bwc test similar to full cluster restart that kills nodes with TERM instead of KILL and runs a single test to make sure the cluster is formed.

@DaveCTurner
Copy link
Contributor Author

Ah I'm pleased to hear that we're running CURRENT -> CURRENT non-upgrade upgrades, thanks for that.

I'm on the fence here. I think we want to be checking a lot more than just the cluster is formed. We run quite a lot of code in reaction to a SIGTERM, some of which makes changes on disk. I think we shouldn't be skipping over this in all real-node tests. But on the other hand we are shutting nodes down cleanly in EsIntegTestCase-based tests so maybe the gap isn't so big.

@rjernst
Copy link
Member

rjernst commented Sep 11, 2019

Thus we could have a bwc test similar to full cluster restart that kills nodes with TERM instead of KILL and runs a single test to make sure the cluster is formed.

We could do this similar to how we alter the distribution tested for qa projects, by having a setting on the testcluster that changes how the node is shutdown, and changing that in a dedicated CI job for the bwc tests. I would be happy with this if we only tested CURRENT -> CURRENT for the alternate shutdown method, so the dedicated CI job should run fast.

@alpar-t
Copy link
Contributor

alpar-t commented Sep 12, 2019

Similarly to my take in #46378, I think these should be dedicated tasks for the projects where it makes sense. They can be tied to bwcTest only so they only get executed in the bwc test matrix and not in intake and PR builds.

I think it should be fairly obvious from the build code what all the cases we cover are, allowing CI to dumber in this respect.

The way I see it we should have "axes" in CI which are applicable to all tests. These are things like jdk version, Os on which we are running, additional jvm arguments, running with dedicated master nodes or not. And we should have something that expresses how
often something gets ran, the lather being expressed in the build model, similarly to how we run check on PR and intake and bwcTest less often.
Any setup beyond this, like the fact that we have to run bwcTest in a matrix job for performance reason is an exception to the rule and we need to be very careful to only
do it if absolutely necessary, making sure we don't optimize early is key in this regard.
Also these optimizations should be done in a way in which most people wouldn't have to care. An example here is how we do the checkPart1 and checkPart2 split automatically. Bwc builds are not a good example because one has to care to set up the right tasks for things to be picked up in CI, and we should fix that with a "bwc test" plugin that would abstract that away.

@rjernst
Copy link
Member

rjernst commented Sep 12, 2019

I would be fine with a dedicated test for SIGTERM behavior (or SIGKILL behavior, since really we expect in the normal case for a node to be shut down cleanly before restarting). But then it should be a test on its own, asserting things we expect to have happened in eg the SIGKILL case (like did we properly fsync at the right time?). If we want dedicated tests, they should be dedicated, not duplicating every test we have that serve various and unrelated purposes to the things we are trying to check.

@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@droberts195
Copy link
Contributor

in fact I think we should also be checking that a SIGTERM followed by a SIGKILL results in correct behaviour on a restart.

Yes, definitely, because this is exactly how nodes are shut down in cloud: first a SIGTERM is sent, but after a timeout a SIGKILL is sent if ES is still running. I believe the timeout has been changed over the years. Ideally the tests should use the same timeout that cloud is using or a shorter one to catch the cases where killing part way through graceful termination causes a problem that would affect cloud users if not found in testing.

@mark-vieira mark-vieira added Team:Delivery Meta label for Delivery team and removed Team:Core/Infra Meta label for core/infra team labels Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure >enhancement Team:Delivery Meta label for Delivery team
Projects
None yet
Development

No branches or pull requests

8 participants