-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add topology spread constraints test for RayCluster #2472
Add topology spread constraints test for RayCluster #2472
Conversation
Update the YAML file to not succeed resources available for the test Add script to validate the toplogy spread constraints Add script to validate the toplogy spread constraints Sets minReplicas to replicas to avoid pods killing themselves prematurely Fix formating issue Add more visibility about pending pods Adjust the the expected running pod count to include the head pod Check the hostnames for testing env Add 2 workers to the created k8s cluster Add visibility to pods Add visibility to pods Fix autoscaler sidecar not launching Add more visibility Add more visibility Add cleanup of previous test pods Cleanup the topology validation script Cleanup the topology validation script Move the topology test to avoid breaking the e2e test Cleanup the topology constraint test cluster Fix formatting issue Update helm chart values and template Fix helm chart lint issue Fix formatting issue
@kevin85421 can you please review this PR? |
cc @MortalHappiness would you mind reviewing this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more like a kind of sample yaml test. So you need to change the following files instead:
./tests/framework/config/kind-config-buildkite.yml
: Add more nodes here.ray-operator/test/sampleyaml/raycluster_test.go
: Create a new function namedTestRayClusterTopologySC
. You can reference the implementation ofTestRayCluster
function in the same file.- Remember to check the number of nodes and the property of nodes first.
- Also remember to sync your branch with the master branch
Other related files:
.buildkite/test-sample-yamls.yml
cc @kevin85421 What do you think?
@@ -0,0 +1,110 @@ | |||
apiVersion: ray.io/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this file to ray-cluster.topology-spread-constraints.yaml
to keep it consistent with the naming of other files.
Adds the testing Go func for TSC Migrate testing of TSP from GHA to buildkite Migrate testing of TSP from GHA to buildkite Migrate testing of TSP from GHA to buildkite Fix formatting issue Fix formatting issue Fix formatting issue Fix formatting issue
@MortalHappiness Thanks for the review. I addressed the suggestions you had. Please let me know if that's what you had in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Testing the behavior of topologySpreadConstraints
is out of scope for KubeRay. KubeRay only ensures that the user-specified Pod spec is passed to the underlying Ray Pods. Would you mind removing everything instead of the changes under helm-chart/ray-cluster
?
Instead, can you write more details about how do you manually test this PR in the PR description?
Thanks @kevin85421. The requested changes are done. |
@kevin85421 Please kindly take a look, thanks! |
Thank you for the PR! |
@YoussefEssDS @Superskyyy sorry for the delay. Feel free to ping me on Ray Slack next time! |
This PR adds a test to verify the functionality of topology spread constraints in a RayCluster setup. The test ensures that worker pods respect the topology spread constraints defined in the cluster's YAML file, particularly focusing on distributing pods across nodes while adhering to maxSkew and avoiding node overloading.
Thanks to @nemo9cby, @liuxsh9, @Superskyyy and @Bye-legumes for the help!
The modifications include:
A new YAML file with topology spread constraints applied to the worker groups.
An update to the e2e test setup to deploy the cluster with the new topology spread configuration.
Validation logic to ensure pods are scheduled according to the specified constraints.
Why are these changes needed?
Topology spread constraints are a Kubernetes feature that helps to distribute pods evenly across nodes to improve high availability and fault tolerance. This test is necessary to ensure that RayCluster configurations support these constraints properly, providing users with greater control over pod scheduling. It helps in scenarios where workload distribution across nodes is critical for resource balancing.
Related issue number
This PR addresses the need for testing topology spread constraints in the KubeRay setup, as requested in a related issue #2273
Checks
I've made sure the tests are passing.
Testing Strategy:
kubectl describe pod pending_pod
shows that the pod cannot be scheduled because of TSC.For example describing the pending pods shows constraints messages such as:
node(s) didn't match pod topology spread constraints (2x);