-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that containers do not get stuck in stopping #22641
Ensure that containers do not get stuck in stopping #22641
Conversation
Still needs a test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic looks good
libpod/container_internal.go
Outdated
if err := unix.Kill(c.state.PID, 0); err == nil { | ||
// We have a runaway container, unmanaged by | ||
// Conmon. Invoke OCI runtime stop. | ||
// Use 0 timeout for immediate SIGKILL as things | ||
// have gone seriously wrong. | ||
// Ignore the error from stopWithAll, it's just | ||
// a cgroup check - more important that we | ||
// continue. | ||
// If we wanted to be really fancy here, we | ||
// could open a pidfd on container PID1 before | ||
// this to get the real exit code... But I'm not | ||
// that dedicated. | ||
all, _ := c.stopWithAll() | ||
if err := c.ociRuntime.StopContainer(c, 0, all); err != nil { | ||
logrus.Errorf("Error stopping container %s after Conmon exited prematurely: %w", c.ID(), err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of any case were conmon can leak the container main process as it uses PDEATHSIG so I like to have it clear in the comment that this likely only relevant in the pid=host case were process are not killed as there is no pid namespace which does it for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've heard of this happening even with PDEATHSIG set from support. Not sure of exact circumstances, but my trust in Conmon's ability to take the container down on being killed is not absolute.
libpod/container_internal.go
Outdated
if err := c.runtime.state.AddContainerExitCode(c.ID(), c.state.ExitCode); err != nil { | ||
logrus.Errorf("Error saving container %s exit code after Conmon exited prematurely: %w", c.ID(), err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have a real exit code so should we actually store this -1 exit code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for certain cases in podman run
- otherwise we'd pop up an angry, duplicate error about not having a sensible error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhh, yeah I guess it keeps it more consitent internally given many assumptions on having the exit code available even when it is sort of a lie.
8afca63
to
766764f
Compare
@@ -1465,4 +1465,34 @@ search | $IMAGE | | |||
is "$output" "Error.*: $expect" "podman emits useful diagnostic when no entrypoint is set" | |||
} | |||
|
|||
@test "podman run - stopping loop" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edsantiago PTAL
test/system/030-run.bats
Outdated
while :; do | ||
sleep 0.5 | ||
run_podman logs testctr | ||
if [[ "$output" =~ READY ]]; then | ||
break | ||
fi | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but since you have to repush anyway, please use wait_for_ready testctr
instead of this loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG wait no
test/system/030-run.bats
Outdated
run_podman inspect --format '{{ .State.ConmonPid }}' testctr | ||
conmon_pid=$output | ||
|
||
$(PODMAN) stop testctr & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curly braces, not parens
test/system/030-run.bats
Outdated
kill -9 $(stop_pid) | ||
kill -9 $(conmon_pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
766764f
to
5b00546
Compare
test/system/030-run.bats
Outdated
@test "podman run - stopping loop" { | ||
skip_if_remote "this doesn't work with with the REST service" | ||
|
||
run_podman run -t -d --name testctr --stop-timeout 240 $IMAGE sh -c 'trap "echo READY" INT; while:; do sleep 0.1; done' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a space between while
and the colon.
You can run this yourself via
$ ./bats --rootless 030:"stopping loop"
(I include --rootless simply to avoid the sudo password thing, because there's no difference here anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to s/INT/TERM/
, and possibly nuke the -t
, but I still can't get this test to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another wait_for_ready to make sure sh -c
is good to go before we trigger the first stop - maybe that'll do it. Will check if it fixed things tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's much worse than that. I lied: you can't use wait_for_ready
because that function "helpfully" checks container status and bails if != running.
I'm working on something. Will let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This passes, but I suspect it's not quite exactly what you had hoped for. In particular, the -t0
in the last podman stop
. I can't make anything work without that.
@test "podman run - stopping loop" {
skip_if_remote "this doesn't work with with the REST service"
run_podman run -d --name testctr --stop-timeout 240 $IMAGE sh -c 'echo READY; sleep 999'
cid="$output"
wait_for_ready testctr
run_podman inspect --format '{{ .State.ConmonPid }}' testctr
conmon_pid=$output
${PODMAN} stop testctr &
stop_pid=$!
timeout=20
while :;do
sleep 0.5
run_podman container inspect --format '{{.State.Status}}' testctr
if [[ "$output" = "stopping" ]]; then
break
fi
timeout=$((timeout - 1))
if [[ $timeout == 0 ]]; then
run_podman ps -a
die "Timed out waiting for testctr to acknowledge signal"
fi
done
kill -9 ${stop_pid}
kill -9 ${conmon_pid}
PODMAN_TIMEOUT=5 run_podman 125 stop -t0 testctr
is "$output" "Error: container $cid conmon exited prematurely, exit code could not be retrieved: internal libpod error" "correct error on missing conmon"
# This should be safe because stop is guaranteed to call cleanup?
run_podman inspect --format "{{ .State.Status }}" testctr
is "$output" "stopped" "container has successfully transitioned to exited state after stop"
}
3b2338b
to
d474b2d
Compare
Oof. The cleanup in |
The scenario for inducing this is as follows: 1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM 2. Use `podman stop` to stop that container 3. Simultaneously, in another terminal, kill -9 `pidof podman` (the container is now in ContainerStateStopping) 4. Now kill that container's Conmon with SIGKILL. 5. No commands are able to move the container from Stopping to Stopped now. The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running. Fixes containers#19629 Signed-off-by: Matt Heon <[email protected]>
d474b2d
to
3fa8e98
Compare
This is ready now |
|
||
# This should be safe because stop is guaranteed to call cleanup? | ||
run_podman inspect --format "{{ .State.Status }}" testctr | ||
is "$output" "exited" "container has successfully transitioned to exited state after stop" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... I'm nervous. In my version, this line said stopped
. Yours says exited
. Is that because of your c.cleanup()
change? If it isn't, is it possible that we're hitting that old bug where sometimes podman sets one and sometimes the other? I'm running a test loop now, see if I ever get stopped
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We want exited
. That indicates that cleanup has run properly, the container is unmounted, networking is torn down. stopped
means we're still mounted, and we don't want to be stuck there.
@containers/podman-maintainers PTAL |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
But I must say the stop logic is really not nice to follow/understand. I would not be surprised if there are more corner cases.
The fact that we now have a second place where the event exited event is generated and the state is reset doesn't make things better as well.
|
||
# Unclear why `-t0` is required here, works locally without. | ||
# But it shouldn't hurt and does make the test pass... | ||
PODMAN_TIMEOUT=5 run_podman 125 stop -t0 testctr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edsantiago Why PODMAN_TIMEOUT=5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover artifact from my testing: I got tired of ^Cing on every single iteration (until I hit upon the -t0
trick). I just copy-pasted what I had, and didn't notice that it included PODMAN_TIMEOUT
. When I noticed it in the PR, I thought it probably doesn't hurt anything. Does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confusing to readers that ask why is this here.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago, Luap99, mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0c09421
into
containers:main
The scenario for inducing this is as follows:
podman stop
to stop that containerpidof podman
(the container is now in ContainerStateStopping)The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running.
Fixes #19629
Does this PR introduce a user-facing change?