Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topology spread constraints test for RayCluster #2472

Conversation

YoussefEssDS
Copy link
Contributor

@YoussefEssDS YoussefEssDS commented Oct 24, 2024

This PR adds a test to verify the functionality of topology spread constraints in a RayCluster setup. The test ensures that worker pods respect the topology spread constraints defined in the cluster's YAML file, particularly focusing on distributing pods across nodes while adhering to maxSkew and avoiding node overloading.
Thanks to @nemo9cby, @liuxsh9, @Superskyyy and @Bye-legumes for the help!

The modifications include:
A new YAML file with topology spread constraints applied to the worker groups.
An update to the e2e test setup to deploy the cluster with the new topology spread configuration.
Validation logic to ensure pods are scheduled according to the specified constraints.

Why are these changes needed?
Topology spread constraints are a Kubernetes feature that helps to distribute pods evenly across nodes to improve high availability and fault tolerance. This test is necessary to ensure that RayCluster configurations support these constraints properly, providing users with greater control over pod scheduling. It helps in scenarios where workload distribution across nodes is critical for resource balancing.

Related issue number
This PR addresses the need for testing topology spread constraints in the KubeRay setup, as requested in a related issue #2273

Checks
I've made sure the tests are passing.
Testing Strategy:

  1. Cluster setup: Configure a k8s cluster with one control plane and two worker nodes. Ensure the control plane has its default taint to prevent regular workloads from being scheduled on it.
  2. RayCluster configuration: Define replicas=minReplicas=2 for worker pods and set the topologySpreadConstraints on kubernetes.io/hostname and a maxSkew=1.
  3. Expected behavior: The cluster should created 5 pods but only allow the head pod and two worker pods to be scheduled, while other pods will remain in a pending state due to the control plane taint and the constraints applied through TSC. I.e., Assigning one more pod to a worker node would break maxSkew constraint.
  4. Examining kubectl describe pod pending_pod shows that the pod cannot be scheduled because of TSC.
    For example describing the pending pods shows constraints messages such as:
    node(s) didn't match pod topology spread constraints (2x);

Update the YAML file to not succeed resources available for the test

Add script to validate the toplogy spread constraints

Add script to validate the toplogy spread constraints

Sets minReplicas to replicas to avoid pods killing themselves prematurely

Fix formating issue

Add more visibility about pending pods

Adjust the the expected running pod count to include the head pod

Check the hostnames for testing env

Add 2 workers to the created k8s cluster

Add visibility to pods

Add visibility to pods

Fix autoscaler sidecar not launching

Add more visibility

Add more visibility

Add cleanup of previous test pods

Cleanup the topology validation script

Cleanup the topology validation script

Move the topology test to avoid breaking the e2e test

Cleanup the topology constraint test cluster

Fix formatting issue

Update helm chart values and template

Fix helm chart lint issue

Fix formatting issue
@YoussefEssDS
Copy link
Contributor Author

@kevin85421 can you please review this PR?

@kevin85421 kevin85421 self-assigned this Oct 24, 2024
@kevin85421
Copy link
Member

cc @MortalHappiness would you mind reviewing this PR?

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more like a kind of sample yaml test. So you need to change the following files instead:

  • ./tests/framework/config/kind-config-buildkite.yml: Add more nodes here.
  • ray-operator/test/sampleyaml/raycluster_test.go: Create a new function named TestRayClusterTopologySC. You can reference the implementation of TestRayCluster function in the same file.
  • Remember to check the number of nodes and the property of nodes first.
  • Also remember to sync your branch with the master branch

Other related files:

  • .buildkite/test-sample-yamls.yml

cc @kevin85421 What do you think?

@@ -0,0 +1,110 @@
apiVersion: ray.io/v1
Copy link
Member

@MortalHappiness MortalHappiness Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this file to ray-cluster.topology-spread-constraints.yaml to keep it consistent with the naming of other files.

Adds the testing Go func for TSC

Migrate testing of TSP from GHA to buildkite

Migrate testing of TSP from GHA to buildkite

Migrate testing of TSP from GHA to buildkite

Fix formatting issue

Fix formatting issue

Fix formatting issue

Fix formatting issue
@YoussefEssDS
Copy link
Contributor Author

@MortalHappiness Thanks for the review. I addressed the suggestions you had. Please let me know if that's what you had in mind.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Testing the behavior of topologySpreadConstraints is out of scope for KubeRay. KubeRay only ensures that the user-specified Pod spec is passed to the underlying Ray Pods. Would you mind removing everything instead of the changes under helm-chart/ray-cluster?

Instead, can you write more details about how do you manually test this PR in the PR description?

@YoussefEssDS
Copy link
Contributor Author

Thanks @kevin85421. The requested changes are done.

@Superskyyy
Copy link

@kevin85421 Please kindly take a look, thanks!

@kevin85421
Copy link
Member

Thank you for the PR!

@kevin85421 kevin85421 merged commit 36102a0 into ray-project:master Nov 5, 2024
29 checks passed
@kevin85421
Copy link
Member

@YoussefEssDS @Superskyyy sorry for the delay. Feel free to ping me on Ray Slack next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants