Small timout for worker notification #242

fkorotkov · 2025-02-05T13:15:11Z

It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors.

This change makes Notifier to wait for 30 seconds before failing, at the time of calling Notifier#Notify we know such worker exists.

PS not sure if we need to make the timeout configurable.

It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors. This change makes Notifier to wait for 30 seconds before failing, at the time of calling `Notifier#Notify` we know such worker exists. PS not sure if we need to make the timeout configurable.

edigaryev · 2025-02-05T16:50:41Z

internal/controller/notifier/notifier.go

@@ -47,6 +51,17 @@ func (watcher *Notifier) Register(ctx context.Context, worker string) (chan *rpc

 func (watcher *Notifier) Notify(ctx context.Context, worker string, msg *rpc.WatchInstruction) error {
 	slot, ok := watcher.workers.Load(worker)
+
+	deadline := time.Now().Add(watcher.workerWaitTimeout)


What do you think about simply respecting the ctx here?

We already have a configurable timeout which defaults to 30 seconds and translates to a context.Context:

orchard/internal/controller/api_vms_ip.go

Lines 29 to 30 in 88fba80

waitContext, waitContextCancel := context.WithTimeout(ctx, time.Duration(wait)*time.Second)

defer waitContextCancel()

You'll just need to pass it to Notify() and respect it while waiting for the worker to re-connect.

Something like 23e36df?

Yes, there are a couple of other Notify() invocations that need the proper context supplied to them, though.

But not all of them have config for wait. Will see how to pass it around

I actually start liking this idea less and less particurally because of this use:

orchard/internal/controller/scheduler/scheduler.go

Lines 352 to 356 in 581de32

if err := scheduler.notifier.Notify(context.Background(), affectedWorker, &rpc.WatchInstruction{

Action: &rpc.WatchInstruction_SyncVmsAction{},

}); err != nil {

scheduler.logger.Errorf("Failed to reactively sync VMs on worker %s: %v", affectedWorker, err)

}

With just waiting for context such use in the future might end up in a infinity loop. What do you think of passing a timeout duration explicitly to Notify and respect both context and the timeout.

Reverted back. I think there should be some default on the worker notify channel reconnection.

With just waiting for context such use in the future might end up in a infinity loop.

Passing a time-bounded context should to the trick.

But overall it's a two sided coin: with the changes currently in this PR, we'll needlessly wait 30 seconds for each network-flipped worker in the scheduler, and there could be a lot of them.

See aa71c03 but what about these two places 4d27b84

internal/controller/notifier/notifier.go

fkorotkov requested a review from edigaryev February 5, 2025 13:15

edigaryev reviewed Feb 5, 2025

View reviewed changes

internal/controller/notifier/notifier.go Show resolved Hide resolved

fkorotkov force-pushed the fedor-worker-notify-timeout branch from 23e36df to 1e6ebd4 Compare February 6, 2025 11:26

fkorotkov added 3 commits February 6, 2025 08:56

Wait via context

aa71c03

Make sure all contexts for Notify is time bounded

4d27b84

Lint issues

e5fcf91

fkorotkov enabled auto-merge (squash) February 6, 2025 15:31

edigaryev approved these changes Feb 6, 2025

View reviewed changes

fkorotkov merged commit 86f0afb into main Feb 6, 2025
3 of 4 checks passed

fkorotkov deleted the fedor-worker-notify-timeout branch February 6, 2025 17:30

edigaryev mentioned this pull request Feb 6, 2025

Worker notification improvements #246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small timout for worker notification #242

Small timout for worker notification #242

fkorotkov commented Feb 5, 2025

edigaryev Feb 5, 2025

fkorotkov Feb 5, 2025

edigaryev Feb 5, 2025

fkorotkov Feb 5, 2025

fkorotkov Feb 6, 2025

fkorotkov Feb 6, 2025

edigaryev Feb 6, 2025

fkorotkov Feb 6, 2025

	waitContext, waitContextCancel := context.WithTimeout(ctx, time.Duration(wait)*time.Second)
	defer waitContextCancel()

	if err := scheduler.notifier.Notify(context.Background(), affectedWorker, &rpc.WatchInstruction{
	Action: &rpc.WatchInstruction_SyncVmsAction{},
	}); err != nil {
	scheduler.logger.Errorf("Failed to reactively sync VMs on worker %s: %v", affectedWorker, err)
	}

Small timout for worker notification #242

Small timout for worker notification #242

Conversation

fkorotkov commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment