Ensure that containers do not get stuck in stopping #22641

mheon · 2024-05-08T13:19:22Z

The scenario for inducing this is as follows:

Start a container with a long stop timeout and a PID1 that ignores SIGTERM
Use podman stop to stop that container
Simultaneously, in another terminal, kill -9 pidof podman (the container is now in ContainerStateStopping)
Now kill that container's Conmon with SIGKILL.
No commands are able to move the container from Stopping to Stopped now.

The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running.

Fixes #19629

Does this PR introduce a user-facing change?

Fixed a bug where Systemd-managed containers could be stuck in the Stopping state, unable to be restarted, if systemd killed the unit before `podman stop` finished stopping the container ([#19629](https://github.com/containers/podman/issues/19629)).

mheon · 2024-05-08T13:19:29Z

Still needs a test

Luap99

logic looks good

Luap99 · 2024-05-08T13:28:55Z

libpod/container_internal.go

+			if err := unix.Kill(c.state.PID, 0); err == nil {
+				// We have a runaway container, unmanaged by
+				// Conmon. Invoke OCI runtime stop.
+				// Use 0 timeout for immediate SIGKILL as things
+				// have gone seriously wrong.
+				// Ignore the error from stopWithAll, it's just
+				// a cgroup check - more important that we
+				// continue.
+				// If we wanted to be really fancy here, we
+				// could open a pidfd on container PID1 before
+				// this to get the real exit code... But I'm not
+				// that dedicated.
+				all, _ := c.stopWithAll()
+				if err := c.ociRuntime.StopContainer(c, 0, all); err != nil {
+					logrus.Errorf("Error stopping container %s after Conmon exited prematurely: %w", c.ID(), err)
+				}
+			}


I am not aware of any case were conmon can leak the container main process as it uses PDEATHSIG so I like to have it clear in the comment that this likely only relevant in the pid=host case were process are not killed as there is no pid namespace which does it for us.

I've heard of this happening even with PDEATHSIG set from support. Not sure of exact circumstances, but my trust in Conmon's ability to take the container down on being killed is not absolute.

Luap99 · 2024-05-08T13:33:55Z

libpod/container_internal.go

+			if err := c.runtime.state.AddContainerExitCode(c.ID(), c.state.ExitCode); err != nil {
+				logrus.Errorf("Error saving container %s exit code after Conmon exited prematurely: %w", c.ID(), err)
+			}


we don't have a real exit code so should we actually store this -1 exit code?

Needed for certain cases in podman run - otherwise we'd pop up an angry, duplicate error about not having a sensible error code.

Mhh, yeah I guess it keeps it more consitent internally given many assumptions on having the exit code available even when it is sort of a lie.

mheon · 2024-05-08T19:16:52Z

test/system/030-run.bats

@@ -1465,4 +1465,34 @@ search               | $IMAGE           |
    is "$output" "Error.*: $expect" "podman emits useful diagnostic when no entrypoint is set"
 }

+@test "podman run - stopping loop" {


@edsantiago PTAL

edsantiago · 2024-05-08T19:27:29Z

test/system/030-run.bats

+    while :; do
+	sleep 0.5
+	run_podman logs testctr
+	if [[ "$output" =~ READY ]]; then
+	    break
+	fi
+    done


LGTM but since you have to repush anyway, please use wait_for_ready testctr instead of this loop

edsantiago

OMG wait no

edsantiago · 2024-05-08T19:27:57Z

test/system/030-run.bats

+    run_podman inspect --format '{{ .State.ConmonPid }}' testctr
+    conmon_pid=$output
+
+    $(PODMAN) stop testctr &


Curly braces, not parens

edsantiago · 2024-05-08T19:28:05Z

test/system/030-run.bats

+    kill -9 $(stop_pid)
+    kill -9 $(conmon_pid)


edsantiago · 2024-05-08T20:58:57Z

test/system/030-run.bats

+@test "podman run - stopping loop" {
+    skip_if_remote "this doesn't work with with the REST service"
+
+    run_podman run -t -d --name testctr --stop-timeout 240 $IMAGE sh -c 'trap "echo READY" INT; while:; do sleep 0.1; done'


Need a space between while and the colon.

You can run this yourself via

$ ./bats --rootless 030:"stopping loop"

(I include --rootless simply to avoid the sudo password thing, because there's no difference here anyway)

You also need to s/INT/TERM/, and possibly nuke the -t, but I still can't get this test to work.

Added another wait_for_ready to make sure sh -c is good to go before we trigger the first stop - maybe that'll do it. Will check if it fixed things tomorrow.

Oh it's much worse than that. I lied: you can't use wait_for_ready because that function "helpfully" checks container status and bails if != running.

I'm working on something. Will let you know.

This passes, but I suspect it's not quite exactly what you had hoped for. In particular, the -t0 in the last podman stop. I can't make anything work without that.

@test "podman run - stopping loop" { skip_if_remote "this doesn't work with with the REST service" run_podman run -d --name testctr --stop-timeout 240 $IMAGE sh -c 'echo READY; sleep 999' cid="$output" wait_for_ready testctr run_podman inspect --format '{{ .State.ConmonPid }}' testctr conmon_pid=$output ${PODMAN} stop testctr & stop_pid=$! timeout=20 while :;do sleep 0.5 run_podman container inspect --format '{{.State.Status}}' testctr if [[ "$output" = "stopping" ]]; then break fi timeout=$((timeout - 1)) if [[ $timeout == 0 ]]; then run_podman ps -a die "Timed out waiting for testctr to acknowledge signal" fi done kill -9 ${stop_pid} kill -9 ${conmon_pid} PODMAN_TIMEOUT=5 run_podman 125 stop -t0 testctr is "$output" "Error: container $cid conmon exited prematurely, exit code could not be retrieved: internal libpod error" "correct error on missing conmon" # This should be safe because stop is guaranteed to call cleanup? run_podman inspect --format "{{ .State.Status }}" testctr is "$output" "stopped" "container has successfully transitioned to exited state after stop" }

mheon · 2024-05-09T15:08:38Z

Oof. The cleanup in ContainerStop() only happens if there was no error - so the error we're expecting prevents it, and prevents container cleanup. Actual code bug, we need to trigger a cleanup manually in this case otherwise it won't happen.

The scenario for inducing this is as follows: 1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM 2. Use `podman stop` to stop that container 3. Simultaneously, in another terminal, kill -9 `pidof podman` (the container is now in ContainerStateStopping) 4. Now kill that container's Conmon with SIGKILL. 5. No commands are able to move the container from Stopping to Stopped now. The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running. Fixes containers#19629 Signed-off-by: Matt Heon <[email protected]>

mheon · 2024-05-09T18:04:40Z

This is ready now

edsantiago · 2024-05-09T18:11:10Z

test/system/030-run.bats

+
+    # This should be safe because stop is guaranteed to call cleanup?
+    run_podman inspect --format "{{ .State.Status }}" testctr
+    is "$output" "exited" "container has successfully transitioned to exited state after stop"


Hmmm... I'm nervous. In my version, this line said stopped. Yours says exited. Is that because of your c.cleanup() change? If it isn't, is it possible that we're hitting that old bug where sometimes podman sets one and sometimes the other? I'm running a test loop now, see if I ever get stopped.

Yes. We want exited. That indicates that cleanup has run properly, the container is unmounted, networking is torn down. stopped means we're still mounted, and we don't want to be stuck there.

mheon · 2024-05-10T13:18:29Z

@containers/podman-maintainers PTAL

rhatdan · 2024-05-13T12:28:09Z

/lgtm

Luap99

LGTM
But I must say the stop logic is really not nice to follow/understand. I would not be surprised if there are more corner cases.

The fact that we now have a second place where the event exited event is generated and the state is reset doesn't make things better as well.

Luap99 · 2024-05-13T12:27:03Z

test/system/030-run.bats

+
+    # Unclear why `-t0` is required here, works locally without.
+    # But it shouldn't hurt and does make the test pass...
+    PODMAN_TIMEOUT=5 run_podman 125 stop -t0 testctr


@edsantiago Why PODMAN_TIMEOUT=5?

Leftover artifact from my testing: I got tired of ^Cing on every single iteration (until I hit upon the -t0 trick). I just copy-pasted what I had, and didn't notice that it included PODMAN_TIMEOUT. When I noticed it in the PR, I thought it probably doesn't hurt anything. Does it?

It is confusing to readers that ask why is this here.

openshift-ci · 2024-05-13T12:29:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,edsantiago,mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the release-note label May 8, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2024

Luap99 reviewed May 8, 2024

View reviewed changes

mheon force-pushed the handle_stopping_loop branch from 8afca63 to 766764f Compare May 8, 2024 19:16

mheon commented May 8, 2024

View reviewed changes

edsantiago reviewed May 8, 2024

View reviewed changes

edsantiago suggested changes May 8, 2024

View reviewed changes

mheon force-pushed the handle_stopping_loop branch from 766764f to 5b00546 Compare May 8, 2024 19:57

edsantiago reviewed May 8, 2024

View reviewed changes

mheon force-pushed the handle_stopping_loop branch 2 times, most recently from 3b2338b to d474b2d Compare May 9, 2024 13:55

mheon force-pushed the handle_stopping_loop branch from d474b2d to 3fa8e98 Compare May 9, 2024 15:17

edsantiago reviewed May 9, 2024

View reviewed changes

edsantiago approved these changes May 9, 2024

View reviewed changes

openshift-ci bot assigned rhatdan May 13, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2024

Luap99 approved these changes May 13, 2024

View reviewed changes

openshift-merge-bot bot merged commit 0c09421 into containers:main May 13, 2024
89 of 91 checks passed

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 12, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that containers do not get stuck in stopping #22641

Ensure that containers do not get stuck in stopping #22641

mheon commented May 8, 2024

mheon commented May 8, 2024

Luap99 left a comment

Luap99 May 8, 2024

mheon May 8, 2024

Luap99 May 8, 2024

mheon May 8, 2024

Luap99 May 8, 2024

mheon May 8, 2024

edsantiago May 8, 2024

edsantiago left a comment

edsantiago May 8, 2024

edsantiago May 8, 2024

edsantiago May 8, 2024

edsantiago May 8, 2024

mheon May 8, 2024

edsantiago May 8, 2024

edsantiago May 8, 2024

mheon commented May 9, 2024

mheon commented May 9, 2024

edsantiago May 9, 2024

mheon May 9, 2024

mheon commented May 10, 2024

rhatdan commented May 13, 2024

Luap99 left a comment

Luap99 May 13, 2024

edsantiago May 13, 2024

Luap99 May 13, 2024

openshift-ci bot commented May 13, 2024

Ensure that containers do not get stuck in stopping #22641

Ensure that containers do not get stuck in stopping #22641

Conversation

mheon commented May 8, 2024

Does this PR introduce a user-facing change?

mheon commented May 8, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edsantiago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mheon commented May 9, 2024

mheon commented May 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mheon commented May 10, 2024

rhatdan commented May 13, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented May 13, 2024