-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small timout for worker notification #242
Conversation
It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors. This change makes Notifier to wait for 30 seconds before failing, at the time of calling `Notifier#Notify` we know such worker exists. PS not sure if we need to make the timeout configurable.
@@ -47,6 +51,17 @@ func (watcher *Notifier) Register(ctx context.Context, worker string) (chan *rpc | |||
|
|||
func (watcher *Notifier) Notify(ctx context.Context, worker string, msg *rpc.WatchInstruction) error { | |||
slot, ok := watcher.workers.Load(worker) | |||
|
|||
deadline := time.Now().Add(watcher.workerWaitTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about simply respecting the ctx
here?
We already have a configurable timeout which defaults to 30 seconds and translates to a context.Context
:
orchard/internal/controller/api_vms_ip.go
Lines 29 to 30 in 88fba80
waitContext, waitContextCancel := context.WithTimeout(ctx, time.Duration(wait)*time.Second) | |
defer waitContextCancel() |
You'll just need to pass it to Notify()
and respect it while waiting for the worker to re-connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like 23e36df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are a couple of other Notify()
invocations that need the proper context supplied to them, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But not all of them have config for wait. Will see how to pass it around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually start liking this idea less and less particurally because of this use:
orchard/internal/controller/scheduler/scheduler.go
Lines 352 to 356 in 581de32
if err := scheduler.notifier.Notify(context.Background(), affectedWorker, &rpc.WatchInstruction{ | |
Action: &rpc.WatchInstruction_SyncVmsAction{}, | |
}); err != nil { | |
scheduler.logger.Errorf("Failed to reactively sync VMs on worker %s: %v", affectedWorker, err) | |
} |
With just waiting for context such use in the future might end up in a infinity loop. What do you think of passing a timeout
duration explicitly to Notify
and respect both context and the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted back. I think there should be some default on the worker notify channel reconnection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With just waiting for context such use in the future might end up in a infinity loop.
Passing a time-bounded context should to the trick.
But overall it's a two sided coin: with the changes currently in this PR, we'll needlessly wait 30 seconds for each network-flipped worker in the scheduler, and there could be a lot of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23e36df
to
1e6ebd4
Compare
It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors.
This change makes Notifier to wait for 30 seconds before failing, at the time of calling
Notifier#Notify
we know such worker exists.PS not sure if we need to make the timeout configurable.