-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Versioning flake fixes #7436
base: main
Are you sure you want to change the base?
Conversation
tests/worker_deployment_test.go
Outdated
time.Sleep(1 * time.Millisecond) | ||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flakes seen in the WorkerDeploymentSuite were because the second go-routine's updates happened before the first one. I was a little surprised that this was possible given there was a 1 millisecond time-gap between the two go-routines.
Thus, I increased the time gap between the go-routine's to see the errors resolved. Moreover, have also run them in the CI and they have passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 1st one go s.pollFromDeployment(ctx, tv)
and the second one go func() {
on line 441?
The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there.
Furthermore, increasing the sleep is not a sure way to keep this non-flaky. You need some kind of signal/wait condition to know you're good to continue. Anything based on sleeps is either brittle, slow or both, unfortunately.
Do you think you can replace the Sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! This comment had me shocked: The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there
For some reason, I thought of the scheduler placing the go-routines in a queue and picking the one that was scheduled earlier. I will replace the sleep here and use signals to account for the ordering I need :)
Note - In this context, the first go-routine is defined from Lines 600 - 610 and the second go-routine is defined from lines 616 - 626.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a bit more, this doesn't seem to be as easy as I first thought it could be - The aim of this function is to send in two update requests such that both pass the update validator's checks but the second request gets caught in the update handler's validator check. Because of this constraint, I can't send in these requests sequentially.
What I desire is the following:
The first go-routine's request passes the validator, enters the handlers and gives up control. Then, the second go-routine passes the validator (it will since the first request is not fully complete and has not changed the state), and then gives up control. However, the second go-routine's request shall never complete successfully since the control goes back to the first go-routine and it changes state. On switching control back to the second go-routine, it conducts a state check again and realizes it can't proceed since the state has changed.
From what you just mentioned, I gauge that having a definitive ordering guarantee between the two go-routines may not be possible.
Update Handler -
func (d *WorkflowRunner) handleSetRampingVersion(ctx workflow.Context, args *deploymentspb.SetRampingVersionArgs) (*deploymentspb.SetRampingVersionResponse, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolution: thought about this for quite a bit and realized there is no way around this. Made a decision to remove sleep and rely on a deterministic ordering for my test cases. Instead, I now check if either one of them completed which shall hopefully resolve the flakes.
Thank you @stephanos for making me aware about this!
@@ -2033,15 +2036,18 @@ func (s *Versioning3Suite) waitForDeploymentDataPropagation( | |||
delete(remaining, pt) | |||
} | |||
} | |||
if unversionedRamp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
placing this block of code above the version-presence checks since there was a tiny error in the way this function was written.
Consider a task-queue which had version "v1" synced to it and was followed by the task-queue having an un-versioned ramp sync. This function would not properly allow us to assert whether all the task-queue partitions had the unversioned ramp sync propagated if the tv.Version
was already present in the task-queue's user data.
This can be seen in TestDescribeTaskQueueVersioningInfo
:
s.syncTaskQueueDeploymentData(tv, true, 0, false, t1, tqTypeAct) // sync "tv" to tqTypeAct
// Now ramp to unversioned
s.syncTaskQueueDeploymentData(tv, false, 10, true, t2, tqTypeAct) // sync "unversioned" to tqTypeAct
s.waitForDeploymentDataPropagation(tv, true, tqTypeAct)
The previous version of this function would incorrectly return since version tv
is present in tqTypeAct
even though we want it to return after the un-versioned field has been populated.
tests/worker_deployment_test.go
Outdated
time.Sleep(1 * time.Millisecond) | ||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 1st one go s.pollFromDeployment(ctx, tv)
and the second one go func() {
on line 441?
The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there.
Furthermore, increasing the sleep is not a sure way to keep this non-flaky. You need some kind of signal/wait condition to know you're good to continue. Anything based on sleeps is either brittle, slow or both, unfortunately.
Do you think you can replace the Sleep?
tests/worker_deployment_test.go
Outdated
@@ -621,7 +621,7 @@ func (s *WorkerDeploymentSuite) TestSetRampingVersion_ConcurrentUpdates_NonIdemp | |||
Version: tv.DeploymentVersionString(), | |||
Percentage: 5, | |||
ConflictToken: cT, | |||
Identity: tv.Any().String(), // note: different identity | |||
Identity: tv.Any().String(), // note: different identity making this request different from the first one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for it to be a different identity?
If it is, I'd recommend tv.ClientIdentity() + "-OtherClient"
so you preserve the original client ID (since it contains meta info that points to this test; which can be helpful when debugging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for it to be a different identity
yes! The idea for this specific test case was to have two concurrent requests come in which had different identities but the same conflict token - I wanted to check for the case of a request having a stale conflict token doesn't go ahead and change the state of the entity workflow - The different identity helps the test verify if the stale request changed state or not
What changed?
Fixing the following tests:
Why?
How did you test it?
Potential risks
Documentation
Is hotfix candidate?