Integration tests should sometimes cleanly shut nodes down #35585

DaveCTurner · 2018-11-15T12:47:44Z

Today in our real-node integration tests we shut nodes down with kill -9 $PID, i.e. sending SIGKILL to the process, which shuts it down immediately. Although it is very important to test the behaviour of nodes shutting down like this, I think the fact that we always use SIGKILL means we're failing to properly cover the normal shutdown procedure that happens in response to a less aggressive signal.

I think we should also send SIGTERM in these tests.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-15T12:47:45Z

Pinging @elastic/es-core-infra

alpar-t · 2018-11-15T14:14:30Z

Test clusters will terminate with TERM by default.
( I just wrote the code for it, will have a PR shortly ).

alpar-t · 2018-11-21T15:27:09Z

After discussing with the team we reached the conclusion that we do want to have tests that use TERM at some point. When the tests are ported to testclusters plugin this will be easier to do so we would want to delay this until that time.
We kill with KILL by default in the interest of time because we are not interested in what happens after the test.

alpar-t · 2019-09-05T11:25:16Z

@DaveCTurner do you think we should send TERM on a list of specific tests or just implement some tests specifically for this ?

DaveCTurner · 2019-09-05T12:17:53Z

Do we have access to deterministic randomness at the point we're making the choice? I think all tests should work correctly whether the nodes are stopped with TERM or KILL (or a mixture) so the best would be to choose randomly on each test, weighting KILL more heavily since we prefer that for speed reasons.

jasontedor · 2019-09-05T12:23:16Z

I would prefer that we not introduce that kind of randomness into the build. We use to have that kind of randomness all over the place in the build (which GC to use, which JVM flags to use, etc.) and it made debugging a nightmare.

What is the plan with "testing" that we work correctly with SIGTERM? We don't want the build to wait forever, but if we introduce a timeout, then we are adding another vector for flaky tests where we end up failing the build because the process didn't shutdown quickly enough after receiving SIGTERM. Are there post-assertions that we would be making, or only that the process eventually shutdown?

DaveCTurner · 2019-09-05T12:30:52Z

The thing I think we should be checking is that node(s) that were shut down with SIGTERM come up and join the cluster again: particularly full-cluster-restart and rolling-upgrade. I think we don't currently have any that restart a cluster without upgrading, but if we did then the same thing would apply there too.

Sorry, this was not clear from the OP. We don't really care about the behaviour of a node that shuts down never to start again, so SIGKILL is fine for them.

I don't think failing to shut down within a timeout should be considered a failure; in fact I think we should also be checking that a SIGTERM followed by a SIGKILL results in correct behaviour on a restart.

alpar-t · 2019-09-05T13:39:35Z

@DaveCTurner all the bwc tests have a version where they from current version to current version with no actual upgrade in-between.

It would be nice to have focused tests. Correct me if I'm wrong but it would seem that in this case the upgrade might be relevant, but the tests ran against the nodes are not really relevant as we are just looking to make sure the cluster is formed. Full cluster restart vs rolling doesn't seem particularly relevant either.

Thus we could have a bwc test similar to full cluster restart that kills nodes with TERM instead of KILL and runs a single test to make sure the cluster is formed.

DaveCTurner · 2019-09-05T13:53:49Z

Ah I'm pleased to hear that we're running CURRENT -> CURRENT non-upgrade upgrades, thanks for that.

I'm on the fence here. I think we want to be checking a lot more than just the cluster is formed. We run quite a lot of code in reaction to a SIGTERM, some of which makes changes on disk. I think we shouldn't be skipping over this in all real-node tests. But on the other hand we are shutting nodes down cleanly in EsIntegTestCase-based tests so maybe the gap isn't so big.

rjernst · 2019-09-11T18:35:40Z

Thus we could have a bwc test similar to full cluster restart that kills nodes with TERM instead of KILL and runs a single test to make sure the cluster is formed.

We could do this similar to how we alter the distribution tested for qa projects, by having a setting on the testcluster that changes how the node is shutdown, and changing that in a dedicated CI job for the bwc tests. I would be happy with this if we only tested CURRENT -> CURRENT for the alternate shutdown method, so the dedicated CI job should run fast.

alpar-t · 2019-09-12T11:25:06Z

Similarly to my take in #46378, I think these should be dedicated tasks for the projects where it makes sense. They can be tied to bwcTest only so they only get executed in the bwc test matrix and not in intake and PR builds.

I think it should be fairly obvious from the build code what all the cases we cover are, allowing CI to dumber in this respect.

The way I see it we should have "axes" in CI which are applicable to all tests. These are things like jdk version, Os on which we are running, additional jvm arguments, running with dedicated master nodes or not. And we should have something that expresses how
often something gets ran, the lather being expressed in the build model, similarly to how we run check on PR and intake and bwcTest less often.
Any setup beyond this, like the fact that we have to run bwcTest in a matrix job for performance reason is an exception to the rule and we need to be very careful to only
do it if absolutely necessary, making sure we don't optimize early is key in this regard.
Also these optimizations should be done in a way in which most people wouldn't have to care. An example here is how we do the checkPart1 and checkPart2 split automatically. Bwc builds are not a good example because one has to care to set up the right tasks for things to be picked up in CI, and we should fix that with a "bwc test" plugin that would abstract that away.

rjernst · 2019-09-12T18:39:14Z

I would be fine with a dedicated test for SIGTERM behavior (or SIGKILL behavior, since really we expect in the normal case for a node to be shut down cleanly before restarting). But then it should be a test on its own, asserting things we expect to have happened in eg the SIGKILL case (like did we properly fsync at the right time?). If we want dedicated tests, they should be dedicated, not duplicating every test we have that serve various and unrelated purposes to the things we are trying to check.

droberts195 · 2020-08-03T08:03:59Z

in fact I think we should also be checking that a SIGTERM followed by a SIGKILL results in correct behaviour on a restart.

Yes, definitely, because this is exactly how nodes are shut down in cloud: first a SIGTERM is sent, but after a timeout a SIGKILL is sent if ES is still running. I believe the timeout has been changed over the years. Ideally the tests should use the same timeout that cloud is using or a shorter one to catch the cases where killing part way through graceful termination causes a problem that would affect cloud users if not found in testing.

DaveCTurner added :Delivery/Build Build or test infrastructure team-discuss labels Nov 15, 2018

danielmitterdorfer removed the team-discuss label Nov 21, 2018

danielmitterdorfer added the >enhancement label Nov 22, 2018

jasontedor mentioned this issue Sep 5, 2019

REST test against clusters with dedicated master nodes #34563

Open

alpar-t mentioned this issue Sep 12, 2019

Use versions specific distribution folders so we don't need to clean up #46539

Merged

rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020

mark-vieira added Team:Delivery Meta label for Delivery team and removed Team:Core/Infra Meta label for core/infra team labels Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration tests should sometimes cleanly shut nodes down #35585

Integration tests should sometimes cleanly shut nodes down #35585

DaveCTurner commented Nov 15, 2018

elasticmachine commented Nov 15, 2018

alpar-t commented Nov 15, 2018

alpar-t commented Nov 21, 2018

alpar-t commented Sep 5, 2019

DaveCTurner commented Sep 5, 2019

jasontedor commented Sep 5, 2019

DaveCTurner commented Sep 5, 2019 •

edited

Loading

alpar-t commented Sep 5, 2019

DaveCTurner commented Sep 5, 2019

rjernst commented Sep 11, 2019

alpar-t commented Sep 12, 2019

rjernst commented Sep 12, 2019

droberts195 commented Aug 3, 2020

Integration tests should sometimes cleanly shut nodes down #35585

Integration tests should sometimes cleanly shut nodes down #35585

Comments

DaveCTurner commented Nov 15, 2018

elasticmachine commented Nov 15, 2018

alpar-t commented Nov 15, 2018

alpar-t commented Nov 21, 2018

alpar-t commented Sep 5, 2019

DaveCTurner commented Sep 5, 2019

jasontedor commented Sep 5, 2019

DaveCTurner commented Sep 5, 2019 • edited Loading

alpar-t commented Sep 5, 2019

DaveCTurner commented Sep 5, 2019

rjernst commented Sep 11, 2019

alpar-t commented Sep 12, 2019

rjernst commented Sep 12, 2019

droberts195 commented Aug 3, 2020

DaveCTurner commented Sep 5, 2019 •

edited

Loading