podman stuck stopping #19629

FamousL · 2023-08-15T10:52:24Z

Issue Description

When using the systemd service file generated by podman generate systemd server reboots often get the container stuck in an invalid state, "stopping" a container in this state cannot be started, stopped, or otherwise manipulated beyond removal.

It is possible to rescue the container if the server is rebooted an additional time (requiring 2 reboots to return to a normal state - the initial reboot putting the container in this state, plus one to get it out of this state.

To combat this I removed the PIDFile directive from the systemd service file, as another issue (not seeing it off hand) mentioned that the pid file being removed prior to the container being killed properly caused some unexpected behavior.

This seems to have solved the issue (3 reboots in, so small sample size, requiring further testing).

podman version
Client: Podman Engine
Version: 4.6.0
API Version: 4.6.0
Go Version: go1.20.6
Built: Fri Jul 21 08:23:26 2023
OS/Arch: linux/amd64

rpm -q podman
podman-4.6.0-1.fc38.x86_64

Steps to reproduce the issue

create container and run as a daemon (run -d)
create service file using podman generate systemd
set service in place, enable, etc
reboot
observe podman container "stopping"
reboot to restore normal function
remove PIDFile directive
reboot
observe podman container is no longer "stopping"

Describe the results you received

Container was left in an unusable state using generated systemd service file

Describe the results you expected

podman generate systemd should provide a file that will run the service without additional modifications.

podman info output

podman info
host:
  arch: amd64
  buildahVersion: 1.31.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 75.61
    systemPercent: 6.97
    userPercent: 17.41
  cpus: 24
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: server
    version: "38"
  eventLogger: journald
  freeLocks: 1997
  hostname: (hostname)
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.4.10-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 19607564288
  memTotal: 135142043648
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns:
      package: podman-plugins-4.6.0-1.fc38.x86_64
      path: /usr/libexec/cni/dnsname
      version: |-
        CNI dnsname plugin
        version: 1.3.1
        commit: unknown
    package: |-
      podman-plugins-4.6.0-1.fc38.x86_64
      containernetworking-plugins-1.3.0-2.fc38.x86_64
    path: /usr/libexec/cni
  ociRuntime:
    name: crun
    package: crun-1.8.6-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.6
      commit: 73f759f4a39769f60990e7d225f561b4f4f06bcf
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20230625.g32660ce-1.fc38.x86_64
    version: |
      pasta 0^20230625.g32660ce-1.fc38.x86_64
      Copyright Red Hat
      GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 2h 35m 2.00s (Approximately 0.08 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 19
    paused: 0
    running: 17
    stopped: 2
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 107321753600
  graphRootUsed: 98293788672
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 139
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.6.0
  Built: 1689942206
  BuiltTime: Fri Jul 21 08:23:26 2023
  GitCommit: ""
  GoVersion: go1.20.6
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

No

Additional environment details

Host is a bare metal box in my own lab.

Additional information

Until removal of the PIDFile directive this occurred every time a reboot was issued for kernel updates

The text was updated successfully, but these errors were encountered:

rhatdan · 2023-08-16T20:16:22Z

By any chance have you looked at running your service via quadlet under systemd?

https://www.redhat.com/sysadmin/quadlet-podman

FamousL · 2023-08-16T21:18:26Z

I had not, mostly due to not knowing of the tool's existance, however it also claims to use the same generation routines. I am not sure if that would have changed the output but once I get cycles I can test it out.

rhatdan · 2023-08-16T21:26:30Z

We are deprecating podman generate systemd and moving to quadlet. Quadlet has enhanced features of generate systemd, so I would be interested if the same failure happens.

baude · 2023-08-17T14:15:39Z

We are deprecating podman generate systemd

we are ??????????????????

rhatdan · 2023-08-17T14:19:46Z

Yes, we want people to use quadlet.

vrothberg · 2023-09-08T11:27:59Z

Deprecation in the sense of podman generate systemd staying around but being de-emphasized in favor of Quadlet. Only backports and security fixes.

Nonetheless, I don't see how using Quadlet would fix this issue. Let's focus on the Podman bug.

vrothberg · 2023-09-08T11:31:30Z

@FamousL if possible, can you share the exact systemd unit?

vrothberg · 2023-09-08T11:47:41Z

A quick fix is possibly to add the following line to the unit:

ExecStartPre=/usr/bin/podman stop  \
        -t 0 $CONTAINER_ID

FamousL · 2023-09-08T12:35:36Z

Motioneye, basically just a generate command with the pidfile directive removed.

It should be noted that your fix would not work, as the state stopping blocks a stop command.

Unit]
Description=Podman container-c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/run/containers/storage

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStart=/usr/bin/podman start c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133
ExecStop=/usr/bin/podman stop
-t 10 c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133
ExecStopPost=/usr/bin/podman stop
-t 10 c1ffb40417c4e278141326f892f071322f0ba21ff3159c08cacfde42a2e5d133
Type=forking

[Install]
WantedBy=default.target

FamousL · 2023-09-08T12:38:25Z

Currently I have it set to use quadlet, as even with this file, It eventually ended up getting stuck as well.

vrothberg · 2023-09-08T13:29:12Z

Can you also share the command you used to create the container?

FamousL · 2023-09-08T13:31:41Z

podman run --name="motioneye"
-p 8765:8765
--hostname="motioneye"
-v /etc/localtime:/etc/localtime:ro
-v /etc/motioneye:/etc/motioneye
-v /var/lib/motioneye:/var/lib/motioneye
--restart="always"
--detach=true
ccrisan/motioneye:master-amd64

FamousL · 2023-09-08T13:32:37Z

This is how I have the container file set for quadlet currently:
[Unit]
Description=The motioneye container
After=local-fs.target

[Container]
Image=docker.io/ccrisan/motioneye:master-amd64
PublishPort=8765:8765
Volume=/etc/localtime:/etc/localtime:ro
Volume=/etc/motioneye:/etc/motioneye
Volume=/var/lib/motioneye:/var/lib/motioneye

[Service]
Restart=always

[Install]
# Start by default on boot
WantedBy=multi-user.target default.target

vrothberg · 2023-09-08T13:34:27Z

Which problems do you encounter with the Quadlet file? Heads up: I will be out next week, so my next reply may take some time.

FamousL · 2023-09-08T13:36:38Z

I haven't had a problem with quadlet, yet, I added it as a reference point.

vrothberg · 2023-09-08T13:43:56Z

@mheon, what I suspect (but hard to reproduce) is:

The systemd unit gets nuked on shutdown during podman stop, leaving the container in the stopping state.
podman start is unable to recover from that.

@mheon is there a trick to detect when a container hasn't been started yet after a reboot? That would ultimately allow to recover and correct the state during start and then go on with starting the container. podman stop is capable of recovering from that state, but that won't help such systemd units including the podman-restart.service.

mheon · 2023-09-08T13:54:23Z

Reboot should force state back to either Configured or Stopped. Stopping is not valid state after a reboot. If that's happening, it's a bug in our state refresh code; the container should be forced to the Stopped state when we reset the states after reboot.

FamousL · 2023-09-08T14:12:23Z

I realize that the power this suggestion would pose might be too much, however is there a way to add a force-stop? For example a quick scan to force all processes related to the container to exit, then force a stopped state?

In this case there are no such processes (the server had just booted after all, and the initial start command failed due to an invalid state).

mheon · 2023-09-08T14:17:37Z

That's what podman stop -t 0 does. Unfortunately, you can always interrupt Podman during that (SIGKILL, for example), so we need intermediate steps as part of stopping to ensure that Podman being forced down midway through stopping does not put the container in a bad state - which is where we get the Stopping state.

mheon · 2023-09-08T14:20:52Z

@vrothberg I checked, and what you're describing is not possible. If the container is in any state but Exited at reboot time, it is forced to the Configured state. Ref: https://github.com/containers/podman/blob/main/libpod/container_internal.go#L606-L608

FamousL · 2023-09-08T14:21:03Z

I guess I was thinking about something that would be too unsafe to implement. My train of thought was more along the lines of telling the system that it doesn't know where it currently is and to reflect the stopped state instead.

mheon · 2023-09-08T14:22:49Z

podman stop should be safe to run on a container in any state we track (except for maybe paused containers? been a while since I checked that codepath), so it does do what you're describing - stopping a running, stopped, stopping, etc container should always result in an Exited container.

FamousL · 2023-09-08T14:23:09Z

but that was the first thing I did, and the stopping state remained. It wasn't until I did a secondary reboot that the state came back to a workable .... state....

mheon · 2023-09-08T14:27:36Z

Yep, and that's definitely a bug, though I'm still not clear what's going on.

I wonder if the reboot isn't being detected - possible if our temporary directory is not on a tmpfs. Can you verify this (you can get the temporary files directory with podman info --log-level=debug 2>&1 | grep "tmp dir")

FamousL · 2023-09-08T14:28:41Z

time="2023-09-08T10:27:57-04:00" level=debug msg="Using tmp dir /run/libpod"

using mount to verify /run is indeed tmpfs (as some people do weird things)
tmpfs on /run type tmpfs (rw,nosuid,nodev,size=26394932k,nr_inodes=819200,mode=755,inode64)

mheon · 2023-09-08T14:33:19Z

So, somehow, either the state refresh code is not running (it should be the first thing that happens after a fresh reboot, ensures that the database is clean), or something is forcing the container back into Stopping state immediately after reboot.

aklira · 2023-09-27T14:27:55Z

Hi everyone,
Just to let you know that I am experiencing the exact same problem as the one described by @FamousL. And the work around (I removed the PIDFile directive from the systemd service file) he suggested seems to work for me.
But I would be very interested in understanding the cause of the problem and why the work around is fixing the issue.
I am using podman 4.2.0 running on Ubuntu 20.04 LTS

jskacel · 2023-09-30T20:57:12Z

Hit this issue as well. Check tmp via link in comment #19629 (comment) and found it's using /tmp filesystem (not tmpfs). Moved folder /tmp/podman-run-<userid>/libpod/tmp and did a reboot.. containers are up now :)

FamousL · 2023-09-30T21:12:30Z

@jskacel tmp is tmpfs for me, I suppose I should have mentioned that in the original post.

vrothberg · 2023-10-05T12:22:02Z

@mheon, maybe something for the next bug week.

mheon · 2023-10-05T15:45:04Z

Sure.

If anyone has a solid reproducer for this, I'd love to see it; at present, the biggest issue with tracking it down is that I cannot replicate it myself.

spomata · 2023-11-04T17:13:44Z

Hi @mheon, I can always reproduce the same issue with a podman-generated systemd file for a FreeIPA container.
Using podman version 4.4.1 running on RHEL (Rocky Linux) 8 I get a container in "stopping" state solidly at every reboot.

Here's the generated systemd file (with the PIDFile directive):

# container-freeipa-server-container.service
# autogenerated by Podman 4.4.1
# Sat Nov  4 16:20:29 CET 2023

[Unit]
Description=Podman container-freeipa-server-container.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=/run/containers/storage

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStart=/usr/bin/podman start freeipa-server-container
ExecStop=/usr/bin/podman stop  \
	-t 10 freeipa-server-container
ExecStopPost=/usr/bin/podman stop  \
	-t 10 freeipa-server-container
PIDFile=/run/containers/storage/overlay-containers/7a228f82f5e5a2df2b276e672e865bc5bd2727db1efb27670286cf3f1db69828/userdata/conmon.pid
Type=forking

[Install]
WantedBy=default.target

And here the podman run command for this container:

podman run --name freeipa-server-container -d -e IPA_SERVER_IP=$MY_IP \
-h $MY_HOSTNAME --read-only -p 80:80 -p 443:443 -p 389:389 -p 636:636 -p 88:88 -p 464:464 \
-p 8443:8443 -p 88:88/udp -p 464:464/udp -v /var/lib/ipa-data:/data:Z freeipa/freeipa-server:fedora-38

As already mentioned by other people in this issue, removing PIDFile does the trick.

AkashRajvanshi · 2024-03-19T08:14:22Z

Hi, has anyone found a solution to the above problem? I'm facing the same issue with podman 4.6 where containers go into a stopping state after every reboot. The only solution I have found is to use "rm -f" to remove the stuck container and then create a fresh container on every boot.

jawadapl · 2024-03-21T20:10:36Z

This happens intermittently for me too. Though I rarely reboot, so it may be as much as half the time the RHEL 8.9 host is rebooted. I am running several containers in a single pod using the podman-kube@$.service with podman version 4.6.1.

Luap99 · 2024-04-04T10:50:08Z

Please retest with podman 5.0

mheon · 2024-04-24T19:53:44Z

@Luap99 What would you say about adding checks syncContainer() to check if the container and conmon PID are still alive? It'll cost us (we run syncContainer everywhere) but sending a 0 signal to a PID should be pretty quick, and it should handle these "stuck running but not-really" cases in a definitive way.

Luap99 · 2024-04-25T10:31:11Z

@Luap99 What would you say about adding checks syncContainer() to check if the container and conmon PID are still alive? It'll cost us (we run syncContainer everywhere) but sending a 0 signal to a PID should be pretty quick, and it should handle these "stuck running but not-really" cases in a definitive way.

I like to understand the sequence of events first, unless we know what is happening I am not a fan of patches that try to fix it in some other place.

And just going by the comments above the only mentioned podman versions are 4.6 and older so I like to see first that this is actually still an issue with 5.0.

The scenario for inducing this is as follows: 1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM 2. Use `podman stop` to stop that container 3. Simultaneously, in another terminal, kill -9 `pidof podman` (the container is now in ContainerStateStopping) 4. Now kill that container's Conmon with SIGKILL. 5. No commands are able to move the container from Stopping to Stopped now. The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running. Fixes containers#19629 Signed-off-by: Matt Heon <[email protected]>

The scenario for inducing this is as follows: 1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM 2. Use `podman stop` to stop that container 3. Simultaneously, in another terminal, kill -9 `pidof podman` (the container is now in ContainerStateStopping) 4. Now kill that container's Conmon with SIGKILL. 5. No commands are able to move the container from Stopping to Stopped now. The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running. Fixes containers#19629 Addresses: https://issues.redhat.com/browse/ACCELFIX-250 Signed-off-by: Matt Heon <[email protected]> Signed-off-by: tomsweeneyredhat <[email protected]>

FamousL added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2023

kabakaev mentioned this issue Aug 17, 2023

Systemd strange behavior #19493

Closed

mcepl mentioned this issue Apr 20, 2024

The status of the container is stuck in stopping. #16641

Closed

mheon mentioned this issue May 8, 2024

Ensure that containers do not get stuck in stopping #22641

Merged

openshift-merge-bot bot closed this as completed in #22641 May 13, 2024

TomSweeneyRedHat mentioned this issue Jun 24, 2024

[v4.9-rhel] Ensure that containers do not get stuck in stopping #23088

Merged

TomSweeneyRedHat mentioned this issue Jun 24, 2024

[v4.4.1-rhel] Ensure that containers do not get stuck in stopping #23089

Merged

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 12, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 12, 2024

podman stuck stopping #19629

podman stuck stopping #19629

Comments

FamousL commented Aug 15, 2023 • edited Loading

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

rhatdan commented Aug 16, 2023

FamousL commented Aug 16, 2023

rhatdan commented Aug 16, 2023

baude commented Aug 17, 2023

rhatdan commented Aug 17, 2023

vrothberg commented Sep 8, 2023

vrothberg commented Sep 8, 2023

vrothberg commented Sep 8, 2023 • edited Loading

FamousL commented Sep 8, 2023

FamousL commented Sep 8, 2023

vrothberg commented Sep 8, 2023

FamousL commented Sep 8, 2023

FamousL commented Sep 8, 2023 • edited Loading

vrothberg commented Sep 8, 2023

FamousL commented Sep 8, 2023

vrothberg commented Sep 8, 2023

mheon commented Sep 8, 2023

FamousL commented Sep 8, 2023

mheon commented Sep 8, 2023

mheon commented Sep 8, 2023

FamousL commented Sep 8, 2023

mheon commented Sep 8, 2023

FamousL commented Sep 8, 2023 • edited Loading

mheon commented Sep 8, 2023

FamousL commented Sep 8, 2023

mheon commented Sep 8, 2023

aklira commented Sep 27, 2023 • edited Loading

jskacel commented Sep 30, 2023

FamousL commented Sep 30, 2023

vrothberg commented Oct 5, 2023

mheon commented Oct 5, 2023

spomata commented Nov 4, 2023

AkashRajvanshi commented Mar 19, 2024

jawadapl commented Mar 21, 2024

Luap99 commented Apr 4, 2024

mheon commented Apr 24, 2024

Luap99 commented Apr 25, 2024

FamousL commented Aug 15, 2023 •

edited

Loading

vrothberg commented Sep 8, 2023 •

edited

Loading

FamousL commented Sep 8, 2023 •

edited

Loading

FamousL commented Sep 8, 2023 •

edited

Loading

aklira commented Sep 27, 2023 •

edited

Loading