-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration tests should sometimes cleanly shut nodes down #35585
Comments
Pinging @elastic/es-core-infra |
Test clusters will terminate with |
After discussing with the team we reached the conclusion that we do want to have tests that use |
@DaveCTurner do you think we should send |
Do we have access to deterministic randomness at the point we're making the choice? I think all tests should work correctly whether the nodes are stopped with |
I would prefer that we not introduce that kind of randomness into the build. We use to have that kind of randomness all over the place in the build (which GC to use, which JVM flags to use, etc.) and it made debugging a nightmare. What is the plan with "testing" that we work correctly with |
The thing I think we should be checking is that node(s) that were shut down with Sorry, this was not clear from the OP. We don't really care about the behaviour of a node that shuts down never to start again, so I don't think failing to shut down within a timeout should be considered a failure; in fact I think we should also be checking that a |
@DaveCTurner all the bwc tests have a version where they from current version to current version with no actual upgrade in-between. It would be nice to have focused tests. Correct me if I'm wrong but it would seem that in this case the upgrade might be relevant, but the tests ran against the nodes are not really relevant as we are just looking to make sure the cluster is formed. Full cluster restart vs rolling doesn't seem particularly relevant either. Thus we could have a bwc test similar to full cluster restart that kills nodes with |
Ah I'm pleased to hear that we're running I'm on the fence here. I think we want to be checking a lot more than just the cluster is formed. We run quite a lot of code in reaction to a |
We could do this similar to how we alter the distribution tested for qa projects, by having a setting on the testcluster that changes how the node is shutdown, and changing that in a dedicated CI job for the bwc tests. I would be happy with this if we only tested CURRENT -> CURRENT for the alternate shutdown method, so the dedicated CI job should run fast. |
Similarly to my take in #46378, I think these should be dedicated tasks for the projects where it makes sense. They can be tied to I think it should be fairly obvious from the build code what all the cases we cover are, allowing CI to dumber in this respect. The way I see it we should have "axes" in CI which are applicable to all tests. These are things like jdk version, Os on which we are running, additional jvm arguments, running with dedicated master nodes or not. And we should have something that expresses how |
I would be fine with a dedicated test for SIGTERM behavior (or SIGKILL behavior, since really we expect in the normal case for a node to be shut down cleanly before restarting). But then it should be a test on its own, asserting things we expect to have happened in eg the SIGKILL case (like did we properly fsync at the right time?). If we want dedicated tests, they should be dedicated, not duplicating every test we have that serve various and unrelated purposes to the things we are trying to check. |
Yes, definitely, because this is exactly how nodes are shut down in cloud: first a SIGTERM is sent, but after a timeout a SIGKILL is sent if ES is still running. I believe the timeout has been changed over the years. Ideally the tests should use the same timeout that cloud is using or a shorter one to catch the cases where killing part way through graceful termination causes a problem that would affect cloud users if not found in testing. |
Today in our real-node integration tests we shut nodes down with
kill -9 $PID
, i.e. sendingSIGKILL
to the process, which shuts it down immediately. Although it is very important to test the behaviour of nodes shutting down like this, I think the fact that we always useSIGKILL
means we're failing to properly cover the normal shutdown procedure that happens in response to a less aggressive signal.I think we should also send
SIGTERM
in these tests.The text was updated successfully, but these errors were encountered: