Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ability to only SIGTERM the immediate process #10

Merged
merged 2 commits into from
Oct 12, 2021

Conversation

denibertovic
Copy link
Collaborator

This is useful in certain situations (with multiprocessing apps such as
gunicorn or celery) where we want to signal to the "main" process
only and let it handle gracefully terminating the "worker" processes.

In case the "main" process does not gracefully shutdown the "worker"
processes in the given "timeout" we still make sure to send SIGKILL to
all "(-1)" remaining processes.

Hey @snoyberg do you mind going over this and letting me know what you think? Did I miss anything / could this be done in a better way?
I'm not crazy about the flag name but couldn't think of anything more suitable. Also I didn't bump the major version since this is behind a feature flag and wouldn't impact existing users.

This is useful in certain situations (with multiprocessing apps such as
`gunicorn` or `celery`) where we want to signal to the "main" process
only and let *it* handle gracefully terminating the "worker" processes.

In case the "main" process does not gracefully shutdown the "worker"
processes in the given "timeout" we still make sure to send SIGKILL to
all "(-1)" remaining processes.
@snoyberg
Copy link
Member

snoyberg commented Jun 4, 2021

Overall this looks good, but I'm wondering if it actually needs a flag. Maybe the best behavior would be:

  1. SIGTERM immediate child
  2. Sleep for timeout
  3. SIGTERM all processes (-1)
  4. Sleep for timeout
  5. SIGKILL all processes

It seems overall better for all use cases. What do you think?

@denibertovic
Copy link
Collaborator Author

@snoyberg I think I'm inclined to agree with you. The only issue I see is that it might be surprising that in very rare situations the timeout set on the command line will be doubled because of the double SIGTERM attempt and then SIGKILL. But I guess that can be documented.

The reason why I didn't go with that approach initially is because it's a breaking change and I wasn't sure if this would impact existing users. (Although I can't see how yet).
That, and I was basing my implementation on what tini does. For example, tini uses child process groups for this use-case - I believe. That said, I see no difference from sending a signal to the child process group vs to "everything" (-1) since they should be the same in terms of what's running in a container. Right?

Let me try and implement the proposed solution.

@denibertovic
Copy link
Collaborator Author

@snoyberg I totally forgot about this. I actually made the change locally but then it slipped my mind and I didn't push it.
Vacation came along and I totally forgot about it.

Anyway, I've pushed a new commit that does what we discussed above: send sigterm to immediate child -> wait for timeout -> send sigterm to all children -> wait for timeout (AGAIN) -> send sigkill to all children. I ran some local testing and it does seem to work fine for my use case.

I pushed the change as a new commit so we can still choose between the two approaches and makes things easier to revert if we decide to. But, if you want let me know and I can squash the two commit.

There's one thing I'm not too sure about. In a scenario where we don't have a well behaved process (ignores TERM signal) we will essentially be allowing it the double amount of timeout before killing it. I'm not sure if this is a problem since the default timeout is 5 seconds meaning that we would wait for 10 seconds by default. Perhaps we should change the default timeout to 3 seconds so doubling it to 6 will be closer to the original behavior. Or would that be even more confusing for users?

Anyway I know you've got your hands full at the moment so don't feel pressure to respond to this. :)

Copy link
Member

@snoyberg snoyberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment update, otherwise LGTM

Instead we will signal TERM to the immediate child process by default.
After that we wait for `timeout` period and signal TERM to all children.
Finally, we wait for another `timeout` period and signal KILL to all remaining processes
(in case they didn't exit with the TERM signal by now).

It's very important to set -t/--timeout to the appropriate time based on
the graceful shutdown you allow your processes to linger for.
Copy link
Member

@snoyberg snoyberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@snoyberg snoyberg merged commit 6f2e150 into fpco:master Oct 12, 2021
@denibertovic
Copy link
Collaborator Author

@snoyberg awesome thanks! When you get the chance can you do a gh release (with the binary) and pushing the new docker image?

@snoyberg
Copy link
Member

If I made you a maintainer here, would you be up for creating the GH release?

Docker Hub should already be updated: https://hub.docker.com/layers/fpco/pid1/0.1.3.0/images/sha256-05541270069465ac854d1fe79e68cd0a7df64fd42ec42aa5049bed14b82678a4?context=explore

@denibertovic
Copy link
Collaborator Author

@snoyberg sure thing, I can do that.

@denibertovic
Copy link
Collaborator Author

@snoyberg just a reminder about the repo permissions so I can make a release.

@snoyberg
Copy link
Member

Sorry, my mistake, I thought I'd added you already! Added now.

@denibertovic
Copy link
Collaborator Author

hey @snoyberg . forgot about this again... here's the release: https://github.com/fpco/pid1/releases/tag/pid1-0.1.3.0

The process for building a static binary didn't work for me anymore (static-base/Dockerfile doesn't build anymore).... so I had to use some nix magic to get it to produce a static binary. I used @nh2 's approach suggested here: #1 (comment)

I'll make a separate PR with the changes as it might be useful. Certainly it's more reproducible.
If I find the time I'll try and add a github actions pipeline since travis doesn't work anymore and it would be useful to have this build/release stuff automated away again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants