Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] e2e for check the readiness of head Pods for both pending / active clusters #2806

Merged
merged 2 commits into from
Jan 25, 2025

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jan 22, 2025

Why are these changes needed?

Resolves #2787

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian rueian force-pushed the e2e-rayservice-readiness branch from 2f29285 to c314ca1 Compare January 22, 2025 23:40
@rueian rueian marked this pull request as ready for review January 22, 2025 23:43
Comment on lines 100 to 106
g.Eventually(func(g Gomega) int {
heads, err := test.Client().Core().CoreV1().Pods(namespace.Name).List(test.Ctx(), metav1.ListOptions{
LabelSelector: "ray.io/serve=true",
})
g.Expect(err).NotTo(HaveOccurred())
return len(heads.Items)
}, TestTimeoutShort).Should(Equal(2))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we manually delete the label from the active cluster before the upgrade, this condition ensures we check the readiness for both active and pending clusters.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline:

  1. Create a RayService
  2. Wait until the active cluster to be running, and has X endpoints
  3. Trigger zero downtime upgrade. Pending cluster should take a long time to ready (e.g. add a init container to sleep)
  4. Make the proxy actor on the active cluster's head Pod fail and can't receive requests.
  5. Check the number of endpoints become X-1.

serveConfig = strings.Replace(serveConfig, "factor: 5", "factor: 3", -1)

// modify EnableInTreeAutoscaling to trigger a zero downtime upgrade.
rayService.Spec.RayClusterSpec.EnableInTreeAutoscaling = ptr.To[bool](true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayVersion seems to be a safer option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added init container can already trigger the zero downtime upgrade.

@rueian rueian force-pushed the e2e-rayservice-readiness branch 2 times, most recently from 4bd018c to ba5bea2 Compare January 24, 2025 06:54
@@ -24,10 +24,6 @@ func TestRayServiceInPlaceUpdate(t *testing.T) {

rayServiceAC := rayv1ac.RayService(rayServiceName, namespace.Name).WithSpec(rayServiceSampleYamlApplyConfiguration())

// TODO: This test will fail on Ray 2.40.0. Pin the Ray version to 2.9.0 as a workaround. Need to remove this after the issue is fixed.
rayServiceAC.Spec.RayClusterSpec.WithRayVersion("2.9.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.41.0 works now.

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421
Copy link
Member

I will open a follow up PR if needed.

@kevin85421 kevin85421 merged commit a612670 into ray-project:master Jan 25, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants