Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster] don't allow overriding ray.io/cluster label #2555

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

andrewsykim
Copy link
Collaborator

Why are these changes needed?

Currently it is possible to specify the ray.io/cluster label in the Head or Worker pod template. If the label value contains an incorrect cluster name, it will result in KubeRay doing an infinite loop of creating Pods since the ray.io/cluster label is used as a label selector.

Similar to other labels like ray.io/group and ray.io/node-type, this PR ensures that overriding the ray.io/cluster is not possible.

Related issue number

N/A

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@andrewsykim
Copy link
Collaborator Author

Steps to reproduce this bug:

  1. Create a Ray cluster with an invalid Ray cluster label:
$ git diff
diff --git a/ray-operator/config/samples/ray-cluster.complete.yaml b/ray-operator/config/samples/ray-cluster.complete.yaml
index cce5c7a4..0cb5cf5d 100644
--- a/ray-operator/config/samples/ray-cluster.complete.yaml
+++ b/ray-operator/config/samples/ray-cluster.complete.yaml
@@ -26,7 +26,8 @@ spec:
       metadata:
         # Custom labels. NOTE: To avoid conflicts with KubeRay operator, do not define custom labels start with `raycluster`.
         # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
-        labels: {}
+        labels:
+          ray.io/cluster: foobar
  1. Create RayCluster and check Pods. Note that Head pod is stuck in infinite creation loop
$ kubectl apply -f config/samples/ray-cluster.complete.yaml
raycluster.ray.io/raycluster-complete created
$ kubectl get po
NAME                                           READY   STATUS     RESTARTS   AGE
kuberay-operator-84fb78dcfd-tlc5z              1/1     Running    0          3m44s
raycluster-complete-head-226px                 0/1     Pending    0          9s
raycluster-complete-head-22784                 0/1     Pending    0          24s
raycluster-complete-head-245p9                 0/1     Running    0          56s
raycluster-complete-head-24xr5                 0/1     Pending    0          45s
raycluster-complete-head-252tk                 0/1     Pending    0          39s
raycluster-complete-head-25cxq                 0/1     Pending    0          10s
raycluster-complete-head-25gh2                 0/1     Pending    0          40s
raycluster-complete-head-25qgc                 0/1     Pending    0          40s
raycluster-complete-head-25rth                 0/1     Pending    0          22s
raycluster-complete-head-27f5t                 0/1     Pending    0          22s
raycluster-complete-head-282pk                 0/1     Pending    0          29s
raycluster-complete-head-288sk                 0/1     Pending    0          26s

@andrewsykim
Copy link
Collaborator Author

@MortalHappiness @kevin85421 PTAL

@andrewsykim
Copy link
Collaborator Author

(adding unit tests)

@MortalHappiness
Copy link
Member

Not related to this PR, but I am wondering if this line is incorrect

if k == string(rayNodeType) {

Shouldn't it be

		if k == utils.RayNodeTypeLabelKey {

@MortalHappiness
Copy link
Member

Actually I found this function very weird:

  • It uses named return ret, but returns labels. Don't know why to return ret, but it modifies labels in-place.
  • The for-loop loops over ret, and it checks if v != groupName here

    But since v is equal to ret[k], it must always equals to groupName, so this line is useless.

@MortalHappiness
Copy link
Member

MortalHappiness commented Nov 19, 2024

Maybe this is what the original author intended to do? WDYT @andrewsykim

func labelPod(rayNodeType rayv1.RayNodeType, rayClusterName string, groupName string, labels map[string]string) map[string]string {
	ret := map[string]string{
		utils.RayNodeLabelKey:                   "yes",
		utils.RayClusterLabelKey:                rayClusterName,
		utils.RayNodeTypeLabelKey:               string(rayNodeType),
		utils.RayNodeGroupLabelKey:              groupName,
		utils.RayIDLabelKey:                     utils.CheckLabel(utils.GenerateIdentifier(rayClusterName, rayNodeType)),
		utils.KubernetesApplicationNameLabelKey: utils.ApplicationName,
		utils.KubernetesCreatedByLabelKey:       utils.ComponentName,
	}

	for k, v := range labels {
		if _, ok := ret[k]; !ok {
			ret[k] = v
		}
	}

	return ret
}

But I am not sure about the expected behavior of keys like RayIDLabelKey. Should we use the value from ret or from labels if it is defined?

@andrewsykim
Copy link
Collaborator Author

@MortalHappiness I noticed the same issues while adding tests, the existing code doesn't seem to be doing what it was intedned to do. I updated to fix the other labels, can you take a look?

@andrewsykim
Copy link
Collaborator Author

andrewsykim commented Nov 19, 2024

func labelPod(rayNodeType rayv1.RayNodeType, rayClusterName string, groupName string, labels map[string]string) map[string]string {
	ret := map[string]string{
		utils.RayNodeLabelKey:                   "yes",
		utils.RayClusterLabelKey:                rayClusterName,
		utils.RayNodeTypeLabelKey:               string(rayNodeType),
		utils.RayNodeGroupLabelKey:              groupName,
		utils.RayIDLabelKey:                     utils.CheckLabel(utils.GenerateIdentifier(rayClusterName, rayNodeType)),
		utils.KubernetesApplicationNameLabelKey: utils.ApplicationName,
		utils.KubernetesCreatedByLabelKey:       utils.ComponentName,
	}

	for k, v := range labels {
		if _, ok := ret[k]; !ok {
			ret[k] = v
		}
	}

	return ret
}

I didn't do this implementation because it seems like some labels are allowed to be overwritten, like utils.KubernetesApplicationNameLabelKey

@MortalHappiness
Copy link
Member

MortalHappiness commented Nov 19, 2024

Besides, the return value of this function should be changed from (ret map[string]string) to map[string]string because we don't use named return as I mentioned earlier.

@andrewsykim
Copy link
Collaborator Author

Besides, the return value of this function should be changed from (ret map[string]string) to map[string]string because we don't use named return as I mentioned earlier.

done

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewsykim andrewsykim merged commit 8d60d61 into ray-project:master Nov 20, 2024
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants