Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topology spread constraints test for RayCluster #3

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

YoussefEssDS
Copy link
Owner

@YoussefEssDS YoussefEssDS commented Oct 24, 2024

This PR adds a test to verify the functionality of topology spread constraints in a RayCluster setup. The test ensures that worker pods respect the topology spread constraints defined in the cluster's YAML file, particularly focusing on distributing pods across nodes while adhering to maxSkew and avoiding node overloading.
Thanks to for the help!

The modifications include:

A new YAML file with topology spread constraints applied to the worker groups.
An update to the e2e test setup to deploy the cluster with the new topology spread configuration.
Validation logic to ensure pods are scheduled according to the specified constraints.
Why are these changes needed?

Topology spread constraints are a Kubernetes feature that helps to distribute pods evenly across nodes to improve high availability and fault tolerance. This test is necessary to ensure that RayCluster configurations support these constraints properly, providing users with greater control over pod scheduling. It helps in scenarios where workload distribution across nodes is critical for resource balancing.

Related issue number

This PR addresses the need for testing topology spread constraints in the KubeRay setup, as requested in a related issue ray-project#2273

Checks
I've made sure the tests are passing.
Testing Strategy
Manual tests: Validated topology spread constraint behavior by observing pod distribution in the test cluster + description of pending worker pods.

@YoussefEssDS YoussefEssDS force-pushed the add-topology-spread-constraints-test-v3 branch from 799e3d1 to 9fbb17a Compare October 24, 2024 20:21
Youssef Esseddiq added 6 commits October 24, 2024 16:22
Update the YAML file to not succeed resources available for the test

Add script to validate the toplogy spread constraints

Add script to validate the toplogy spread constraints

Sets minReplicas to replicas to avoid pods killing themselves prematurely

Fix formating issue

Add more visibility about pending pods

Adjust the the expected running pod count to include the head pod

Check the hostnames for testing env

Add 2 workers to the created k8s cluster

Add visibility to pods

Add visibility to pods

Fix autoscaler sidecar not launching

Add more visibility

Add more visibility

Add cleanup of previous test pods

Cleanup the topology validation script

Cleanup the topology validation script

Move the topology test to avoid breaking the e2e test

Cleanup the topology constraint test cluster

Fix formatting issue

Update helm chart values and template

Fix helm chart lint issue

Fix formatting issue
@YoussefEssDS YoussefEssDS force-pushed the add-topology-spread-constraints-test-v3 branch from ab24d23 to cfadbaa Compare October 29, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant