Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman stuck stopping #19629

Closed
FamousL opened this issue Aug 15, 2023 · 37 comments · Fixed by #22641
Closed

podman stuck stopping #19629

FamousL opened this issue Aug 15, 2023 · 37 comments · Fixed by #22641
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@FamousL
Copy link

FamousL commented Aug 15, 2023

Issue Description

When using the systemd service file generated by podman generate systemd server reboots often get the container stuck in an invalid state, "stopping" a container in this state cannot be started, stopped, or otherwise manipulated beyond removal.

It is possible to rescue the container if the server is rebooted an additional time (requiring 2 reboots to return to a normal state - the initial reboot putting the container in this state, plus one to get it out of this state.

To combat this I removed the PIDFile directive from the systemd service file, as another issue (not seeing it off hand) mentioned that the pid file being removed prior to the container being killed properly caused some unexpected behavior.

This seems to have solved the issue (3 reboots in, so small sample size, requiring further testing).

podman version
Client: Podman Engine
Version: 4.6.0
API Version: 4.6.0
Go Version: go1.20.6
Built: Fri Jul 21 08:23:26 2023
OS/Arch: linux/amd64

rpm -q podman
podman-4.6.0-1.fc38.x86_64

Steps to reproduce the issue

Steps to reproduce the issue

  1. create container and run as a daemon (run -d)
  2. create service file using podman generate systemd
  3. set service in place, enable, etc
  4. reboot
  5. observe podman container "stopping"
  6. reboot to restore normal function
  7. remove PIDFile directive
  8. reboot
  9. observe podman container is no longer "stopping"

Describe the results you received

Container was left in an unusable state using generated systemd service file

Describe the results you expected

podman generate systemd should provide a file that will run the service without additional modifications.

podman info output

podman info
host:
  arch: amd64
  buildahVersion: 1.31.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 75.61
    systemPercent: 6.97
    userPercent: 17.41
  cpus: 24
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: server
    version: "38"
  eventLogger: journald
  freeLocks: 1997
  hostname: (hostname)
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.4.10-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 19607564288
  memTotal: 135142043648
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns:
      package: podman-plugins-4.6.0-1.fc38.x86_64
      path: /usr/libexec/cni/dnsname
      version: |-
        CNI dnsname plugin
        version: 1.3.1
        commit: unknown
    package: |-
      podman-plugins-4.6.0-1.fc38.x86_64
      containernetworking-plugins-1.3.0-2.fc38.x86_64
    path: /usr/libexec/cni
  ociRuntime:
    name: crun
    package: crun-1.8.6-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.6
      commit: 73f759f4a39769f60990e7d225f561b4f4f06bcf
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20230625.g32660ce-1.fc38.x86_64
    version: |
      pasta 0^20230625.g32660ce-1.fc38.x86_64
      Copyright Red Hat
      GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 2h 35m 2.00s (Approximately 0.08 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 19
    paused: 0
    running: 17
    stopped: 2
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 107321753600
  graphRootUsed: 98293788672
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 139
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.6.0
  Built: 1689942206
  BuiltTime: Fri Jul 21 08:23:26 2023
  GitCommit: ""
  GoVersion: go1.20.6
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

No

Additional environment details

Host is a bare metal box in my own lab.

Additional information

Until removal of the PIDFile directive this occurred every time a reboot was issued for kernel updates

@FamousL FamousL added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2023
@rhatdan
Copy link
Member

rhatdan commented Aug 16, 2023

By any chance have you looked at running your service via quadlet under systemd?

https://www.redhat.com/sysadmin/quadlet-podman

@FamousL
Copy link
Author

FamousL commented Aug 16, 2023

I had not, mostly due to not knowing of the tool's existance, however it also claims to use the same generation routines. I am not sure if that would have changed the output but once I get cycles I can test it out.

@rhatdan
Copy link
Member

rhatdan commented Aug 16, 2023

We are deprecating podman generate systemd and moving to quadlet. Quadlet has enhanced features of generate systemd, so I would be interested if the same failure happens.

@baude
Copy link
Member

baude commented Aug 17, 2023

We are deprecating podman generate systemd

we are ??????????????????

@rhatdan
Copy link
Member

rhatdan commented Aug 17, 2023

Yes, we want people to use quadlet.

@vrothberg
Copy link
Member

Deprecation in the sense of podman generate systemd staying around but being de-emphasized in favor of Quadlet. Only backports and security fixes.

Nonetheless, I don't see how using Quadlet would fix this issue. Let's focus on the Podman bug.

@vrothberg
Copy link
Member

@FamousL if possible, can you share the exact systemd unit?

@vrothberg
Copy link
Member

vrothberg commented Sep 8, 2023

A quick fix is possibly to add the following line to the unit:

ExecStartPre=/usr/bin/podman stop  \
        -t 0 $CONTAINER_ID

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

Motioneye, basically just a generate command with the pidfile directive removed.

It should be noted that your fix would not work, as the state stopping blocks a stop command.

Unit]
Description=Podman container-c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/run/containers/storage

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStart=/usr/bin/podman start c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133
ExecStop=/usr/bin/podman stop
-t 10 c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133
ExecStopPost=/usr/bin/podman stop
-t 10 c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133
Type=forking

[Install]
WantedBy=default.target

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

Currently I have it set to use quadlet, as even with this file, It eventually ended up getting stuck as well.

@vrothberg
Copy link
Member

Can you also share the command you used to create the container?

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

podman run --name="motioneye"
-p 8765:8765
--hostname="motioneye"
-v /etc/localtime:/etc/localtime:ro
-v /etc/motioneye:/etc/motioneye
-v /var/lib/motioneye:/var/lib/motioneye
--restart="always"
--detach=true
ccrisan/motioneye:master-amd64

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

This is how I have the container file set for quadlet currently:
[Unit]
Description=The motioneye container
After=local-fs.target

[Container]
Image=docker.io/ccrisan/motioneye:master-amd64
PublishPort=8765:8765
Volume=/etc/localtime:/etc/localtime:ro
Volume=/etc/motioneye:/etc/motioneye
Volume=/var/lib/motioneye:/var/lib/motioneye

[Service]
Restart=always

[Install]
# Start by default on boot
WantedBy=multi-user.target default.target

@vrothberg
Copy link
Member

Which problems do you encounter with the Quadlet file? Heads up: I will be out next week, so my next reply may take some time.

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

I haven't had a problem with quadlet, yet, I added it as a reference point.

@vrothberg
Copy link
Member

@mheon, what I suspect (but hard to reproduce) is:

  • The systemd unit gets nuked on shutdown during podman stop, leaving the container in the stopping state.
  • podman start is unable to recover from that.

@mheon is there a trick to detect when a container hasn't been started yet after a reboot? That would ultimately allow to recover and correct the state during start and then go on with starting the container. podman stop is capable of recovering from that state, but that won't help such systemd units including the podman-restart.service.

@mheon
Copy link
Member

mheon commented Sep 8, 2023

Reboot should force state back to either Configured or Stopped. Stopping is not valid state after a reboot. If that's happening, it's a bug in our state refresh code; the container should be forced to the Stopped state when we reset the states after reboot.

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

I realize that the power this suggestion would pose might be too much, however is there a way to add a force-stop? For example a quick scan to force all processes related to the container to exit, then force a stopped state?

In this case there are no such processes (the server had just booted after all, and the initial start command failed due to an invalid state).

@mheon
Copy link
Member

mheon commented Sep 8, 2023

That's what podman stop -t 0 does. Unfortunately, you can always interrupt Podman during that (SIGKILL, for example), so we need intermediate steps as part of stopping to ensure that Podman being forced down midway through stopping does not put the container in a bad state - which is where we get the Stopping state.

@mheon
Copy link
Member

mheon commented Sep 8, 2023

@vrothberg I checked, and what you're describing is not possible. If the container is in any state but Exited at reboot time, it is forced to the Configured state. Ref: https://github.com/containers/podman/blob/main/libpod/container_internal.go#L606-L608

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

I guess I was thinking about something that would be too unsafe to implement. My train of thought was more along the lines of telling the system that it doesn't know where it currently is and to reflect the stopped state instead.

@mheon
Copy link
Member

mheon commented Sep 8, 2023

podman stop should be safe to run on a container in any state we track (except for maybe paused containers? been a while since I checked that codepath), so it does do what you're describing - stopping a running, stopped, stopping, etc container should always result in an Exited container.

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

but that was the first thing I did, and the stopping state remained. It wasn't until I did a secondary reboot that the state came back to a workable .... state....

@mheon
Copy link
Member

mheon commented Sep 8, 2023

Yep, and that's definitely a bug, though I'm still not clear what's going on.

I wonder if the reboot isn't being detected - possible if our temporary directory is not on a tmpfs. Can you verify this (you can get the temporary files directory with podman info --log-level=debug 2>&1 | grep "tmp dir")

@FamousL
Copy link
Author

FamousL commented Sep 8, 2023

time="2023-09-08T10:27:57-04:00" level=debug msg="Using tmp dir /run/libpod"

using mount to verify /run is indeed tmpfs (as some people do weird things)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=26394932k,nr_inodes=819200,mode=755,inode64)

@mheon
Copy link
Member

mheon commented Sep 8, 2023

So, somehow, either the state refresh code is not running (it should be the first thing that happens after a fresh reboot, ensures that the database is clean), or something is forcing the container back into Stopping state immediately after reboot.

@aklira
Copy link

aklira commented Sep 27, 2023

Hi everyone,
Just to let you know that I am experiencing the exact same problem as the one described by @FamousL. And the work around (I removed the PIDFile directive from the systemd service file) he suggested seems to work for me.
But I would be very interested in understanding the cause of the problem and why the work around is fixing the issue.
I am using podman 4.2.0 running on Ubuntu 20.04 LTS

@jskacel
Copy link

jskacel commented Sep 30, 2023

Hit this issue as well. Check tmp via link in comment #19629 (comment) and found it's using /tmp filesystem (not tmpfs). Moved folder /tmp/podman-run-<userid>/libpod/tmp and did a reboot.. containers are up now :)

@FamousL
Copy link
Author

FamousL commented Sep 30, 2023

@jskacel tmp is tmpfs for me, I suppose I should have mentioned that in the original post.

@vrothberg
Copy link
Member

@mheon, maybe something for the next bug week.

@mheon
Copy link
Member

mheon commented Oct 5, 2023

Sure.

If anyone has a solid reproducer for this, I'd love to see it; at present, the biggest issue with tracking it down is that I cannot replicate it myself.

@spomata
Copy link

spomata commented Nov 4, 2023

Hi @mheon, I can always reproduce the same issue with a podman-generated systemd file for a FreeIPA container.
Using podman version 4.4.1 running on RHEL (Rocky Linux) 8 I get a container in "stopping" state solidly at every reboot.

Here's the generated systemd file (with the PIDFile directive):

# container-freeipa-server-container.service
# autogenerated by Podman 4.4.1
# Sat Nov  4 16:20:29 CET 2023

[Unit]
Description=Podman container-freeipa-server-container.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/run/containers/storage

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStart=/usr/bin/podman start freeipa-server-container
ExecStop=/usr/bin/podman stop  \
	-t 10 freeipa-server-container
ExecStopPost=/usr/bin/podman stop  \
	-t 10 freeipa-server-container
PIDFile=/run/containers/storage/overlay-containers/7a228f82f5e5a2df2b276e672e865bc5bd2727db1efb27670286cf3f1db69828/userdata/conmon.pid
Type=forking

[Install]
WantedBy=default.target

And here the podman run command for this container:

podman run --name freeipa-server-container -d -e IPA_SERVER_IP=$MY_IP \
-h $MY_HOSTNAME --read-only -p 80:80 -p 443:443 -p 389:389 -p 636:636 -p 88:88 -p 464:464 \
-p 8443:8443 -p 88:88/udp -p 464:464/udp -v /var/lib/ipa-data:/data:Z freeipa/freeipa-server:fedora-38

As already mentioned by other people in this issue, removing PIDFile does the trick.

@AkashRajvanshi
Copy link

Hi, has anyone found a solution to the above problem? I'm facing the same issue with podman 4.6 where containers go into a stopping state after every reboot. The only solution I have found is to use "rm -f" to remove the stuck container and then create a fresh container on every boot.

@jawadapl
Copy link

This happens intermittently for me too. Though I rarely reboot, so it may be as much as half the time the RHEL 8.9 host is rebooted. I am running several containers in a single pod using the podman-kube@$.service with podman version 4.6.1.

@Luap99
Copy link
Member

Luap99 commented Apr 4, 2024

Please retest with podman 5.0

@mheon
Copy link
Member

mheon commented Apr 24, 2024

@Luap99 What would you say about adding checks syncContainer() to check if the container and conmon PID are still alive? It'll cost us (we run syncContainer everywhere) but sending a 0 signal to a PID should be pretty quick, and it should handle these "stuck running but not-really" cases in a definitive way.

@Luap99
Copy link
Member

Luap99 commented Apr 25, 2024

@Luap99 What would you say about adding checks syncContainer() to check if the container and conmon PID are still alive? It'll cost us (we run syncContainer everywhere) but sending a 0 signal to a PID should be pretty quick, and it should handle these "stuck running but not-really" cases in a definitive way.

I like to understand the sequence of events first, unless we know what is happening I am not a fan of patches that try to fix it in some other place.

And just going by the comments above the only mentioned podman versions are 4.6 and older so I like to see first that this is actually still an issue with 5.0.

mheon added a commit to mheon/libpod that referenced this issue May 9, 2024
The scenario for inducing this is as follows:
1. Start a container with a long stop timeout and a PID1 that
   ignores SIGTERM
2. Use `podman stop` to stop that container
3. Simultaneously, in another terminal, kill -9 `pidof podman`
   (the container is now in ContainerStateStopping)
4. Now kill that container's Conmon with SIGKILL.
5. No commands are able to move the container from Stopping to
   Stopped now.

The cause is a logic bug in our exit-file handling logic. Conmon
being dead without an exit file causes no change to the state.
Add handling for this case that tries to clean up, including
stopping the container if it still seems to be running.

Fixes containers#19629

Signed-off-by: Matt Heon <[email protected]>
TomSweeneyRedHat pushed a commit to TomSweeneyRedHat/podman that referenced this issue Jun 24, 2024
The scenario for inducing this is as follows:
1. Start a container with a long stop timeout and a PID1 that
   ignores SIGTERM
2. Use `podman stop` to stop that container
3. Simultaneously, in another terminal, kill -9 `pidof podman`
   (the container is now in ContainerStateStopping)
4. Now kill that container's Conmon with SIGKILL.
5. No commands are able to move the container from Stopping to
   Stopped now.

The cause is a logic bug in our exit-file handling logic. Conmon
being dead without an exit file causes no change to the state.
Add handling for this case that tries to clean up, including
stopping the container if it still seems to be running.

Fixes containers#19629

Addresses: https://issues.redhat.com/browse/ACCELFIX-250

Signed-off-by: Matt Heon <[email protected]>
Signed-off-by: tomsweeneyredhat <[email protected]>
TomSweeneyRedHat pushed a commit to TomSweeneyRedHat/podman that referenced this issue Jun 24, 2024
The scenario for inducing this is as follows:
1. Start a container with a long stop timeout and a PID1 that
   ignores SIGTERM
2. Use `podman stop` to stop that container
3. Simultaneously, in another terminal, kill -9 `pidof podman`
   (the container is now in ContainerStateStopping)
4. Now kill that container's Conmon with SIGKILL.
5. No commands are able to move the container from Stopping to
   Stopped now.

The cause is a logic bug in our exit-file handling logic. Conmon
being dead without an exit file causes no change to the state.
Add handling for this case that tries to clean up, including
stopping the container if it still seems to be running.

Fixes containers#19629

Addresses: https://issues.redhat.com/browse/ACCELFIX-250

Signed-off-by: Matt Heon <[email protected]>
Signed-off-by: tomsweeneyredhat <[email protected]>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 12, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.