Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Versioning flake fixes #7436

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Mar 6, 2025

What changed?

Fixing the following tests:

  • TestDescribeTaskQueueVersioningInfo
  • TestSetCurrentVersion_ConcurrentUpdates_NonIdempotentRequests
  • TestSetRampingVersion_ConcurrentUpdates_NonIdempotentRequests
  • TestSetCurrentVersion_ConcurrentUpdates_NonIdempotentRequests

Why?

  • To hopefully not see them in the flaky test reports.

How did you test it?

  • Ran all of them in CI 50 times with different DB's and all passed.
  • Ran locally as well.

Potential risks

Documentation

Is hotfix candidate?

time.Sleep(1 * time.Millisecond)
time.Sleep(10 * time.Millisecond)
Copy link
Member Author

@Shivs11 Shivs11 Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flakes seen in the WorkerDeploymentSuite were because the second go-routine's updates happened before the first one. I was a little surprised that this was possible given there was a 1 millisecond time-gap between the two go-routines.

Thus, I increased the time gap between the go-routine's to see the errors resolved. Moreover, have also run them in the CI and they have passed.

Copy link
Contributor

@stephanos stephanos Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 1st one go s.pollFromDeployment(ctx, tv) and the second one go func() { on line 441?

The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there.

Furthermore, increasing the sleep is not a sure way to keep this non-flaky. You need some kind of signal/wait condition to know you're good to continue. Anything based on sleeps is either brittle, slow or both, unfortunately.

Do you think you can replace the Sleep?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! This comment had me shocked: The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there

For some reason, I thought of the scheduler placing the go-routines in a queue and picking the one that was scheduled earlier. I will replace the sleep here and use signals to account for the ordering I need :)

Note - In this context, the first go-routine is defined from Lines 600 - 610 and the second go-routine is defined from lines 616 - 626.

Copy link
Member Author

@Shivs11 Shivs11 Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more, this doesn't seem to be as easy as I first thought it could be - The aim of this function is to send in two update requests such that both pass the update validator's checks but the second request gets caught in the update handler's validator check. Because of this constraint, I can't send in these requests sequentially.

What I desire is the following:

The first go-routine's request passes the validator, enters the handlers and gives up control. Then, the second go-routine passes the validator (it will since the first request is not fully complete and has not changed the state), and then gives up control. However, the second go-routine's request shall never complete successfully since the control goes back to the first go-routine and it changes state. On switching control back to the second go-routine, it conducts a state check again and realizes it can't proceed since the state has changed.

From what you just mentioned, I gauge that having a definitive ordering guarantee between the two go-routines may not be possible.

Update Handler -

func (d *WorkflowRunner) handleSetRampingVersion(ctx workflow.Context, args *deploymentspb.SetRampingVersionArgs) (*deploymentspb.SetRampingVersionResponse, error) {

Copy link
Member Author

@Shivs11 Shivs11 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolution: thought about this for quite a bit and realized there is no way around this. Made a decision to remove sleep and rely on a deterministic ordering for my test cases. Instead, I now check if either one of them completed which shall hopefully resolve the flakes.

Thank you @stephanos for making me aware about this!

@@ -2033,15 +2036,18 @@ func (s *Versioning3Suite) waitForDeploymentDataPropagation(
delete(remaining, pt)
}
}
if unversionedRamp {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placing this block of code above the version-presence checks since there was a tiny error in the way this function was written.

Consider a task-queue which had version "v1" synced to it and was followed by the task-queue having an un-versioned ramp sync. This function would not properly allow us to assert whether all the task-queue partitions had the unversioned ramp sync propagated if the tv.Version was already present in the task-queue's user data.

This can be seen in TestDescribeTaskQueueVersioningInfo:

s.syncTaskQueueDeploymentData(tv, true, 0, false, t1, tqTypeAct) // sync "tv" to tqTypeAct

// Now ramp to unversioned
s.syncTaskQueueDeploymentData(tv, false, 10, true, t2, tqTypeAct) // sync "unversioned" to tqTypeAct
s.waitForDeploymentDataPropagation(tv, true, tqTypeAct)

The previous version of this function would incorrectly return since version tv is present in tqTypeAct even though we want it to return after the un-versioned field has been populated.

@Shivs11 Shivs11 marked this pull request as ready for review March 6, 2025 17:34
@Shivs11 Shivs11 requested a review from a team as a code owner March 6, 2025 17:34
time.Sleep(1 * time.Millisecond)
time.Sleep(10 * time.Millisecond)
Copy link
Contributor

@stephanos stephanos Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 1st one go s.pollFromDeployment(ctx, tv) and the second one go func() { on line 441?

The Go scheduler doesn't necessarily pick up a goroutine immediately. It will park them; and once it decides to pause the current goroutine, it picks a random one that is unblocked to continue. So you can never count on any ordering there.

Furthermore, increasing the sleep is not a sure way to keep this non-flaky. You need some kind of signal/wait condition to know you're good to continue. Anything based on sleeps is either brittle, slow or both, unfortunately.

Do you think you can replace the Sleep?

@@ -621,7 +621,7 @@ func (s *WorkerDeploymentSuite) TestSetRampingVersion_ConcurrentUpdates_NonIdemp
Version: tv.DeploymentVersionString(),
Percentage: 5,
ConflictToken: cT,
Identity: tv.Any().String(), // note: different identity
Identity: tv.Any().String(), // note: different identity making this request different from the first one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for it to be a different identity?

If it is, I'd recommend tv.ClientIdentity() + "-OtherClient" so you preserve the original client ID (since it contains meta info that points to this test; which can be helpful when debugging).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for it to be a different identity

yes! The idea for this specific test case was to have two concurrent requests come in which had different identities but the same conflict token - I wanted to check for the case of a request having a stale conflict token doesn't go ahead and change the state of the entity workflow - The different identity helps the test verify if the stale request changed state or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants