Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small timout for worker notification #242

Merged
merged 4 commits into from
Feb 6, 2025
Merged

Conversation

fkorotkov
Copy link
Contributor

It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors.

This change makes Notifier to wait for 30 seconds before failing, at the time of calling Notifier#Notify we know such worker exists.

PS not sure if we need to make the timeout configurable.

It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors.

This change makes Notifier to wait for 30 seconds before failing, at the time of calling `Notifier#Notify` we know such worker exists.

PS not sure if we need to make the timeout configurable.
@fkorotkov fkorotkov requested a review from edigaryev February 5, 2025 13:15
@@ -47,6 +51,17 @@ func (watcher *Notifier) Register(ctx context.Context, worker string) (chan *rpc

func (watcher *Notifier) Notify(ctx context.Context, worker string, msg *rpc.WatchInstruction) error {
slot, ok := watcher.workers.Load(worker)

deadline := time.Now().Add(watcher.workerWaitTimeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about simply respecting the ctx here?

We already have a configurable timeout which defaults to 30 seconds and translates to a context.Context:

waitContext, waitContextCancel := context.WithTimeout(ctx, time.Duration(wait)*time.Second)
defer waitContextCancel()

You'll just need to pass it to Notify() and respect it while waiting for the worker to re-connect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like 23e36df?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are a couple of other Notify() invocations that need the proper context supplied to them, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not all of them have config for wait. Will see how to pass it around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually start liking this idea less and less particurally because of this use:

if err := scheduler.notifier.Notify(context.Background(), affectedWorker, &rpc.WatchInstruction{
Action: &rpc.WatchInstruction_SyncVmsAction{},
}); err != nil {
scheduler.logger.Errorf("Failed to reactively sync VMs on worker %s: %v", affectedWorker, err)
}

With just waiting for context such use in the future might end up in a infinity loop. What do you think of passing a timeout duration explicitly to Notify and respect both context and the timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back. I think there should be some default on the worker notify channel reconnection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With just waiting for context such use in the future might end up in a infinity loop.

Passing a time-bounded context should to the trick.

But overall it's a two sided coin: with the changes currently in this PR, we'll needlessly wait 30 seconds for each network-flipped worker in the scheduler, and there could be a lot of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See aa71c03 but what about these two places 4d27b84

@fkorotkov fkorotkov force-pushed the fedor-worker-notify-timeout branch from 23e36df to 1e6ebd4 Compare February 6, 2025 11:26
@fkorotkov fkorotkov enabled auto-merge (squash) February 6, 2025 15:31
@fkorotkov fkorotkov merged commit 86f0afb into main Feb 6, 2025
3 of 4 checks passed
@fkorotkov fkorotkov deleted the fedor-worker-notify-timeout branch February 6, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants