-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fails mpi-operator early if access to list or watch objects is denied #619
Conversation
I wouldn't make it an option. |
The thought process for making it an option was to have the Right now the
Waiting for Making the "fail-fast" optional (and defaulting it to The use this request is trying to address: when it'll be known upfront that the permission errors are not transient, and you want the Regarding kubelet or the Deployment Controller handling this, it seems that they would only handle this based on a readiness probe, which currently – once the leader is elected – will always come back as healthy. Per my other comment, all of this could technically be handled in a more robust readiness/health probe, but that poses other challenges (namely that |
pkg/controller/mpi_job_controller.go
Outdated
cache.DefaultWatchErrorHandler(r, err) | ||
|
||
if errors.IsUnauthorized(err) || errors.IsForbidden(err) { | ||
klog.Fatalf("Unable to sync cache for informer %s: %s. Exiting.", name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Doesn't seem like it's possible to return an error
:
SetWatchErrorHandler
take a WatchErrorHandler
– which does not return anything. If SetWatchErrorHandler
fails, the code already returns an error
for that.
With regard to Replacing Fatal Calls, klog.ErrorS
and klog.FlushAndExit
is in later version of klog
, but the operator is using v1
, which doesn't have those methods ... 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a mistake: SetWatchErrorHandler
was calling klog.Fatalf()
as well. Changed it to return an error
if it fails.
Exploring some other ways a graceful shutdown can be initiated if the error handler goes get invoked (without calling klog.Fatalf()
):
- Upgrade to
klog/v2
- Call
kubeapiserver.RequestShutdown()
Regarding kubelet or the Deployment Controller handling this, it seems that they would only handle this based on a readiness probe, which currently – once the leader is elected – will always come back as healthy. What I mean is that if you enable fail-fast, we would have the kubelet restart the container or deployment controller to recreate the pod. So effectively, the E2E behavior is that the system continues trying to start the manager. And that E2E behavior is pretty similar to the existing behavior, except that it retries without exiting the binary. Users are not losing anything. Instead, they are getting more visibility into any possible failures, as they will see pods being recreated or restarted, causing them to investigate. |
b597ebd
to
c7fda6e
Compare
Oh, I see what you mean: in effect the pod is exiting and thus being forced into an unhealthy state ... Do you suggest then that the option be completely removed? This will cause the "current behavior" to change ... |
I don't expect anyone to be relying on the "current behavior". The new behavior just moves the retry to a higher level controller, as opposed to being retried within the binary. So yes, I think we can just remove the option. |
I'm just trying not to break stuff 😄
Removed |
cache.DefaultWatchErrorHandler(r, err) | ||
|
||
if errors.IsUnauthorized(err) || errors.IsForbidden(err) { | ||
klog.Fatalf("Unable to sync cache for informer %s: %s. Requesting controller to exit.", name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return error here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the where an error
can't be returned!
SetWatchErrorHandler
takes a WatchErrorHandler
which is defined as:
type WatchErrorHandler func(r *Reflector, err error)
I.e. its args are (*Reflector, error
) – and returns nothing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh true.
Still, please don't use Fatalf
. Follow this guide https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md#replacing-fatal-calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow this guide, klog
needs to be upgraded to klog/v2
. The methods klog.ErrorS()
and klog.FlushAndExit()
are not available in klog/v1.0.0
.
Should klog
be updated to klog/v2
for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:(
Let's leave that for a follow up
/lgtm
/approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash
Absolutely, I appreciate the amount of consideration for backwards-compatibility. |
38299ef
to
5b2f51a
Compare
…idden or unauthorized to resources
5b2f51a
to
f8b77a4
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Currently the
mpi-operator
waits for informer caches to sync before being ready to do work.If access to any objects fail due to permission issues, the failure is simply logged, and until or unless the access issues is resolved, the
mpi-operator
remains in this waiting state indefinitely.This request adds an option to fatally fail the
mpi-operator
if the operator is unable to list or watch the objects due to Forbidden or Unauthorized errors.The option defaults to
false
so that existing functionality / behavior is maintained.