Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v4.9-rhel] Ensure that containers do not get stuck in stopping #23088

Conversation

TomSweeneyRedHat
Copy link
Member

@TomSweeneyRedHat TomSweeneyRedHat commented Jun 24, 2024

The scenario for inducing this is as follows:

  1. Start a container with a long stop timeout and a PID1 that ignores SIGTERM
  2. Use podman stop to stop that container
  3. Simultaneously, in another terminal, kill -9 pidof podman (the container is now in ContainerStateStopping)
  4. Now kill that container's Conmon with SIGKILL.
  5. No commands are able to move the container from Stopping to Stopped now.

The cause is a logic bug in our exit-file handling logic. Conmon being dead without an exit file causes no change to the state. Add handling for this case that tries to clean up, including stopping the container if it still seems to be running.

Fixes #19629

Addresses: https://issues.redhat.com/browse/ACCELFIX-250

Does this PR introduce a user-facing change?

None

The scenario for inducing this is as follows:
1. Start a container with a long stop timeout and a PID1 that
   ignores SIGTERM
2. Use `podman stop` to stop that container
3. Simultaneously, in another terminal, kill -9 `pidof podman`
   (the container is now in ContainerStateStopping)
4. Now kill that container's Conmon with SIGKILL.
5. No commands are able to move the container from Stopping to
   Stopped now.

The cause is a logic bug in our exit-file handling logic. Conmon
being dead without an exit file causes no change to the state.
Add handling for this case that tries to clean up, including
stopping the container if it still seems to be running.

Fixes containers#19629

Addresses: https://issues.redhat.com/browse/ACCELFIX-250

Signed-off-by: Matt Heon <[email protected]>
Signed-off-by: tomsweeneyredhat <[email protected]>
@TomSweeneyRedHat TomSweeneyRedHat added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2024
@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 24, 2024
@TomSweeneyRedHat
Copy link
Member Author

Added the hold to make sure we get a Jira Card assigned and attached to this.

@openshift-ci openshift-ci bot added release-note-none and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Jun 24, 2024
@TomSweeneyRedHat TomSweeneyRedHat added do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jun 24, 2024
@TomSweeneyRedHat
Copy link
Member Author

@edsantiago and/or @cevich I keep running into the below error on the "Validate rawhide Build" test. Other than pressing "try again", is there anything else to do?

Repositories loaded.
Failed to resolve the transaction:
Problem: package python3-libs-3.12.0-2.fc40.i686 requires libtirpc.so.3, but none of the providers can be installed
  - package python3-libs-3.12.0-2.fc40.i686 requires libtirpc.so.3(TIRPC_0.3.0), but none of the providers can be installed
  - package libtirpc-1.3.4-1.rc3.fc41.i686 requires libgssapi_krb5.so.2, but none of the providers can be installed

@edsantiago
Copy link
Member

Seems to be this

validate)
showrun dnf install -y $PACKAGE_DOWNLOAD_DIR/python3*.rpm

Can you try removing that dnf line? And maybe removing it from the other two places that also dnf it? At least that might get you past Validate and onto other tests. (Which of course might fail)

@cevich
Copy link
Member

cevich commented Jun 27, 2024

Another suggestion: Remove the rawhide validation (or possibly ALL rawhide everything) from CI, I don't think old-rawhide is useful on a RHEL release branch.

The Rawhide tests were failing all over the place in the v4.9-rhel
branch.  They are only necessary for main, remove them.

Signed-off-by: tomsweeneyredhat <[email protected]>
@TomSweeneyRedHat TomSweeneyRedHat changed the title [WIP][v4.9-rhel] Ensure that containers do not get stuck in stopping [v4.9-rhel] Ensure that containers do not get stuck in stopping Jun 27, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 27, 2024
@TomSweeneyRedHat TomSweeneyRedHat removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Jun 27, 2024
@TomSweeneyRedHat
Copy link
Member Author

Turning off Rawhide seems to have done the trick. Ready for review and happy green tests buttons. Ditto #23089

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2024
Copy link
Contributor

openshift-ci bot commented Jun 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99, TomSweeneyRedHat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Luap99,TomSweeneyRedHat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit b699052 into containers:v4.9-rhel Jun 28, 2024
54 checks passed
@TomSweeneyRedHat
Copy link
Member Author

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 27, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants