-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman stuck stopping #19629
Comments
By any chance have you looked at running your service via quadlet under systemd? |
I had not, mostly due to not knowing of the tool's existance, however it also claims to use the same generation routines. I am not sure if that would have changed the output but once I get cycles I can test it out. |
We are deprecating podman generate systemd and moving to quadlet. Quadlet has enhanced features of generate systemd, so I would be interested if the same failure happens. |
we are ?????????????????? |
Yes, we want people to use quadlet. |
Deprecation in the sense of Nonetheless, I don't see how using Quadlet would fix this issue. Let's focus on the Podman bug. |
@FamousL if possible, can you share the exact systemd unit? |
A quick fix is possibly to add the following line to the unit:
|
Motioneye, basically just a generate command with the pidfile directive removed. It should be noted that your fix would not work, as the state stopping blocks a stop command. Unit] [Service] [Install] |
Currently I have it set to use quadlet, as even with this file, It eventually ended up getting stuck as well. |
Can you also share the command you used to create the container? |
podman run --name="motioneye" |
This is how I have the container file set for quadlet currently: [Container] [Service] [Install] |
Which problems do you encounter with the Quadlet file? Heads up: I will be out next week, so my next reply may take some time. |
I haven't had a problem with quadlet, yet, I added it as a reference point. |
@mheon, what I suspect (but hard to reproduce) is:
@mheon is there a trick to detect when a container hasn't been started yet after a reboot? That would ultimately allow to recover and correct the state during start and then go on with starting the container. |
Reboot should force state back to either Configured or Stopped. Stopping is not valid state after a reboot. If that's happening, it's a bug in our state refresh code; the container should be forced to the Stopped state when we reset the states after reboot. |
I realize that the power this suggestion would pose might be too much, however is there a way to add a force-stop? For example a quick scan to force all processes related to the container to exit, then force a stopped state? In this case there are no such processes (the server had just booted after all, and the initial start command failed due to an invalid state). |
That's what |
@vrothberg I checked, and what you're describing is not possible. If the container is in any state but Exited at reboot time, it is forced to the Configured state. Ref: https://github.com/containers/podman/blob/main/libpod/container_internal.go#L606-L608 |
I guess I was thinking about something that would be too unsafe to implement. My train of thought was more along the lines of telling the system that it doesn't know where it currently is and to reflect the stopped state instead. |
|
but that was the first thing I did, and the stopping state remained. It wasn't until I did a secondary reboot that the state came back to a workable .... state.... |
Yep, and that's definitely a bug, though I'm still not clear what's going on. I wonder if the reboot isn't being detected - possible if our temporary directory is not on a tmpfs. Can you verify this (you can get the temporary files directory with |
time="2023-09-08T10:27:57-04:00" level=debug msg="Using tmp dir /run/libpod" using mount to verify /run is indeed tmpfs (as some people do weird things) |
So, somehow, either the state refresh code is not running (it should be the first thing that happens after a fresh reboot, ensures that the database is clean), or something is forcing the container back into Stopping state immediately after reboot. |
Hi everyone, |
Hit this issue as well. Check tmp via link in comment #19629 (comment) and found it's using /tmp filesystem (not tmpfs). Moved folder |
@jskacel tmp is tmpfs for me, I suppose I should have mentioned that in the original post. |
@mheon, maybe something for the next bug week. |
Sure. If anyone has a solid reproducer for this, I'd love to see it; at present, the biggest issue with tracking it down is that I cannot replicate it myself. |
Hi @mheon, I can always reproduce the same issue with a podman-generated systemd file for a FreeIPA container. Here's the generated systemd file (with the PIDFile directive):
And here the podman run command for this container:
As already mentioned by other people in this issue, removing PIDFile does the trick. |
Hi, has anyone found a solution to the above problem? I'm facing the same issue with podman 4.6 where containers go into a stopping state after every reboot. The only solution I have found is to use "rm -f" to remove the stuck container and then create a fresh container on every boot. |
This happens intermittently for me too. Though I rarely reboot, so it may be as much as half the time the RHEL 8.9 host is rebooted. I am running several containers in a single pod using the podman-kube@$.service with podman version 4.6.1. |
Please retest with podman 5.0 |
@Luap99 What would you say about adding checks |
I like to understand the sequence of events first, unless we know what is happening I am not a fan of patches that try to fix it in some other place. And just going by the comments above the only mentioned podman versions are 4.6 and older so I like to see first that this is actually still an issue with 5.0. |
The scenario for inducing this is as follows: 1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM 2. Use `podman stop` to stop that container 3. Simultaneously, in another terminal, kill -9 `pidof podman` (the container is now in ContainerStateStopping) 4. Now kill that container's Conmon with SIGKILL. 5. No commands are able to move the container from Stopping to Stopped now. The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running. Fixes containers#19629 Signed-off-by: Matt Heon <[email protected]>
The scenario for inducing this is as follows: 1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM 2. Use `podman stop` to stop that container 3. Simultaneously, in another terminal, kill -9 `pidof podman` (the container is now in ContainerStateStopping) 4. Now kill that container's Conmon with SIGKILL. 5. No commands are able to move the container from Stopping to Stopped now. The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running. Fixes containers#19629 Addresses: https://issues.redhat.com/browse/ACCELFIX-250 Signed-off-by: Matt Heon <[email protected]> Signed-off-by: tomsweeneyredhat <[email protected]>
The scenario for inducing this is as follows: 1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM 2. Use `podman stop` to stop that container 3. Simultaneously, in another terminal, kill -9 `pidof podman` (the container is now in ContainerStateStopping) 4. Now kill that container's Conmon with SIGKILL. 5. No commands are able to move the container from Stopping to Stopped now. The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running. Fixes containers#19629 Addresses: https://issues.redhat.com/browse/ACCELFIX-250 Signed-off-by: Matt Heon <[email protected]> Signed-off-by: tomsweeneyredhat <[email protected]>
Issue Description
When using the systemd service file generated by
podman generate systemd
server reboots often get the container stuck in an invalid state, "stopping" a container in this state cannot be started, stopped, or otherwise manipulated beyond removal.It is possible to rescue the container if the server is rebooted an additional time (requiring 2 reboots to return to a normal state - the initial reboot putting the container in this state, plus one to get it out of this state.
To combat this I removed the PIDFile directive from the systemd service file, as another issue (not seeing it off hand) mentioned that the pid file being removed prior to the container being killed properly caused some unexpected behavior.
This seems to have solved the issue (3 reboots in, so small sample size, requiring further testing).
podman version
Client: Podman Engine
Version: 4.6.0
API Version: 4.6.0
Go Version: go1.20.6
Built: Fri Jul 21 08:23:26 2023
OS/Arch: linux/amd64
rpm -q podman
podman-4.6.0-1.fc38.x86_64
Steps to reproduce the issue
Steps to reproduce the issue
Describe the results you received
Container was left in an unusable state using generated systemd service file
Describe the results you expected
podman generate systemd should provide a file that will run the service without additional modifications.
podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
No
Additional environment details
Host is a bare metal box in my own lab.
Additional information
Until removal of the PIDFile directive this occurred every time a reboot was issued for kernel updates
The text was updated successfully, but these errors were encountered: