More Versioning flake fixes #7436

Shivs11 · 2025-03-06T15:58:12Z

What changed?

Fixing the following tests:

TestDescribeTaskQueueVersioningInfo
TestSetCurrentVersion_ConcurrentUpdates_NonIdempotentRequests
TestSetRampingVersion_ConcurrentUpdates_NonIdempotentRequests
TestSetCurrentVersion_ConcurrentUpdates_NonIdempotentRequests

Why?

To hopefully not see them in the flaky test reports.

How did you test it?

Ran all of them in CI 50 times with different DB's and all passed.
Ran locally as well.

Potential risks

Documentation

Is hotfix candidate?

Shivs11 · 2025-03-06T16:24:45Z

tests/worker_deployment_test.go

-	time.Sleep(1 * time.Millisecond)
+	time.Sleep(10 * time.Millisecond)


The flakes seen in the WorkerDeploymentSuite were because the second go-routine's updates happened before the first one. I was a little surprised that this was possible given there was a 1 millisecond time-gap between the two go-routines.

Thus, I increased the time gap between the go-routine's to see the errors resolved. Moreover, have also run them in the CI and they have passed.

Is the 1st one go s.pollFromDeployment(ctx, tv) and the second one go func() { on line 441?

The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there.

Furthermore, increasing the sleep is not a sure way to keep this non-flaky. You need some kind of signal/wait condition to know you're good to continue. Anything based on sleeps is either brittle, slow or both, unfortunately.

Do you think you can replace the Sleep?

Oh! This comment had me shocked: The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there

For some reason, I thought of the scheduler placing the go-routines in a queue and picking the one that was scheduled earlier. I will replace the sleep here and use signals to account for the ordering I need :)

Note - In this context, the first go-routine is defined from Lines 600 - 610 and the second go-routine is defined from lines 616 - 626.

Thinking about it a bit more, this doesn't seem to be as easy as I first thought it could be - The aim of this function is to send in two update requests such that both pass the update validator's checks but the second request gets caught in the update handler's validator check. Because of this constraint, I can't send in these requests sequentially.

What I desire is the following:

The first go-routine's request passes the validator, enters the handlers and gives up control. Then, the second go-routine passes the validator (it will since the first request is not fully complete and has not changed the state), and then gives up control. However, the second go-routine's request shall never complete successfully since the control goes back to the first go-routine and it changes state. On switching control back to the second go-routine, it conducts a state check again and realizes it can't proceed since the state has changed.

From what you just mentioned, I gauge that having a definitive ordering guarantee between the two go-routines may not be possible.

Update Handler -

temporal/service/worker/workerdeployment/workflow.go

Line 251 in d238a83

func (d *WorkflowRunner) handleSetRampingVersion(ctx workflow.Context, args *deploymentspb.SetRampingVersionArgs) (*deploymentspb.SetRampingVersionResponse, error) {

resolution: thought about this for quite a bit and realized there is no way around this. Made a decision to remove sleep and rely on a deterministic ordering for my test cases. Instead, I now check if either one of them completed which shall hopefully resolve the flakes.

Thank you @stephanos for making me aware about this!

Shivs11 · 2025-03-06T17:05:40Z

tests/versioning_3_test.go

@@ -2033,15 +2036,18 @@ func (s *Versioning3Suite) waitForDeploymentDataPropagation(
 						delete(remaining, pt)
 					}
 				}
+				if unversionedRamp {


placing this block of code above the version-presence checks since there was a tiny error in the way this function was written.

Consider a task-queue which had version "v1" synced to it and was followed by the task-queue having an un-versioned ramp sync. This function would not properly allow us to assert whether all the task-queue partitions had the unversioned ramp sync propagated if the tv.Version was already present in the task-queue's user data.

This can be seen in TestDescribeTaskQueueVersioningInfo:

s.syncTaskQueueDeploymentData(tv, true, 0, false, t1, tqTypeAct) // sync "tv" to tqTypeAct // Now ramp to unversioned s.syncTaskQueueDeploymentData(tv, false, 10, true, t2, tqTypeAct) // sync "unversioned" to tqTypeAct s.waitForDeploymentDataPropagation(tv, true, tqTypeAct)

The previous version of this function would incorrectly return since version tv is present in tqTypeAct even though we want it to return after the un-versioned field has been populated.

stephanos · 2025-03-06T20:01:00Z

tests/worker_deployment_test.go

-	time.Sleep(1 * time.Millisecond)
+	time.Sleep(10 * time.Millisecond)


Is the 1st one go s.pollFromDeployment(ctx, tv) and the second one go func() { on line 441?

The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there.

Furthermore, increasing the sleep is not a sure way to keep this non-flaky. You need some kind of signal/wait condition to know you're good to continue. Anything based on sleeps is either brittle, slow or both, unfortunately.

Do you think you can replace the Sleep?

stephanos · 2025-03-06T20:04:51Z

tests/worker_deployment_test.go

@@ -621,7 +621,7 @@ func (s *WorkerDeploymentSuite) TestSetRampingVersion_ConcurrentUpdates_NonIdemp
 			Version:        tv.DeploymentVersionString(),
 			Percentage:     5,
 			ConflictToken:  cT,
-			Identity:       tv.Any().String(), // note: different identity
+			Identity:       tv.Any().String(), // note: different identity making this request different from the first one.


Is it necessary for it to be a different identity?

If it is, I'd recommend tv.ClientIdentity() + "-OtherClient" so you preserve the original client ID (since it contains meta info that points to this test; which can be helpful when debugging).

Is it necessary for it to be a different identity

yes! The idea for this specific test case was to have two concurrent requests come in which had different identities but the same conflict token - I wanted to check for the case of a request having a stale conflict token doesn't go ahead and change the state of the entity workflow - The different identity helps the test verify if the stale request changed state or not

Shivs11 added 3 commits March 5, 2025 13:28

initial attempt to fixing the flakes

cda20ed

DescribeTaskQueueInfo fix

b02c840

update comment

0a58112

Shivs11 commented Mar 6, 2025

View reviewed changes

minor changes

2e8b894

Shivs11 commented Mar 6, 2025

View reviewed changes

Shivs11 marked this pull request as ready for review March 6, 2025 17:34

Shivs11 requested a review from a team as a code owner March 6, 2025 17:34

stephanos reviewed Mar 6, 2025

View reviewed changes

removed sleep from concurrent tests in worker-deployment suite

3371cff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Versioning flake fixes #7436

More Versioning flake fixes #7436

Shivs11 commented Mar 6, 2025 •

edited

Loading

Shivs11 Mar 6, 2025 •

edited

Loading

stephanos Mar 6, 2025 •

edited

Loading

Shivs11 Mar 6, 2025

Shivs11 Mar 6, 2025 •

edited

Loading

Shivs11 Mar 7, 2025 •

edited

Loading

Shivs11 Mar 6, 2025

stephanos Mar 6, 2025 •

edited

Loading

stephanos Mar 6, 2025

Shivs11 Mar 6, 2025

		time.Sleep(1 * time.Millisecond)
		time.Sleep(10 * time.Millisecond)

More Versioning flake fixes #7436

Are you sure you want to change the base?

More Versioning flake fixes #7436

Conversation

Shivs11 commented Mar 6, 2025 • edited Loading

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

Shivs11 Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

stephanos Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Shivs11 Mar 6, 2025

Choose a reason for hiding this comment

Shivs11 Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Shivs11 Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

Shivs11 Mar 6, 2025

Choose a reason for hiding this comment

stephanos Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

stephanos Mar 6, 2025

Choose a reason for hiding this comment

Shivs11 Mar 6, 2025

Choose a reason for hiding this comment

Shivs11 commented Mar 6, 2025 •

edited

Loading

Shivs11 Mar 6, 2025 •

edited

Loading

stephanos Mar 6, 2025 •

edited

Loading

Shivs11 Mar 6, 2025 •

edited

Loading

Shivs11 Mar 7, 2025 •

edited

Loading

stephanos Mar 6, 2025 •

edited

Loading