Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] RayService applications do not load on v1.3.0 #3133

Open
1 of 2 tasks
boyleconnor opened this issue Feb 28, 2025 · 3 comments
Open
1 of 2 tasks

[Bug] RayService applications do not load on v1.3.0 #3133

boyleconnor opened this issue Feb 28, 2025 · 3 comments
Assignees
Labels
bug Something isn't working P0 Critical issue that should be fixed ASAP

Comments

@boyleconnor
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I tried creating the sample RayService shown in the documentation:

$ helm list
NAME            	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART                 	APP VERSION
kuberay-operator	default  	6       	2025-02-28 09:02:58.070226418 -0800 PST	deployed	kuberay-operator-1.3.0
$ kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-service.sample.yaml
rayservice.ray.io/rayservice-sample created
$ kubectl get pods
NAME                                                          READY   STATUS      RESTARTS      AGE
kuberay-operator-5c7f84f8bc-lj8g5                             1/1     Running     0             24m
rayservice-sample-raycluster-qdpmg-head-tfg82                 1/1     Running     0             24m
rayservice-sample-raycluster-qdpmg-small-group-worker-mzmxl   0/1     Running     1 (23m ago)   30m

Yet when I navigate to the Ray dashboard, it tells me:

Serve not started. Please deploy a serve application first.

Image

Checking status via kubectl gives similar results:

$ kubectl get rayservice -o json | jq .items[].status
{
  "activeServiceStatus": {
    "rayClusterName": "rayservice-sample-raycluster-qdpmg",
    "rayClusterStatus": {
      "availableWorkerReplicas": 1,
      "desiredCPU": "2500m",
      "desiredGPU": "0",
      "desiredMemory": "4Gi",
      "desiredTPU": "0",
      "desiredWorkerReplicas": 1,
      "endpoints": {
        "client": "10001",
        "dashboard": "8265",
        "gcs-server": "6379",
        "metrics": "8080",
        "serve": "8000"
      },
      "head": {
        "podIP": "10.244.168.7",
        "serviceIP": "10.244.168.7"
      },
      "lastUpdateTime": "2025-02-28T17:33:19Z",
      "maxWorkerReplicas": 5,
      "minWorkerReplicas": 1,
      "observedGeneration": 1,
      "state": "ready"
    }
  },
  "lastUpdateTime": "2025-02-28T17:33:19Z",
  "observedGeneration": 1,
  "pendingServiceStatus": {
    "rayClusterStatus": {
      "desiredCPU": "0",
      "desiredGPU": "0",
      "desiredMemory": "0",
      "desiredTPU": "0",
      "head": {}
    }
  }
}

Yet, downgrading to kuberay operator 1.2.2 suddenly fixes the problem:

$ helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.2
Release "kuberay-operator" has been upgraded. Happy Helming!
NAME: kuberay-operator
LAST DEPLOYED: Fri Feb 28 09:36:42 2025
NAMESPACE: default
STATUS: deployed
REVISION: 9
TEST SUITE: None

Image

$ kubectl get rayservice -o json | jq .items[].status
{
  "activeServiceStatus": {
    "applicationStatuses": {
      "fruit_app": {
        "healthLastUpdateTime": "2025-02-28T17:37:11Z",
        "serveDeploymentStatuses": {
          "FruitMarket": {
            "healthLastUpdateTime": "2025-02-28T17:37:11Z",
            "status": "HEALTHY"
          },
          "MangoStand": {
            "healthLastUpdateTime": "2025-02-28T17:37:11Z",
            "status": "HEALTHY"
          },
          "OrangeStand": {
            "healthLastUpdateTime": "2025-02-28T17:37:11Z",
            "status": "HEALTHY"
          },
          "PearStand": {
            "healthLastUpdateTime": "2025-02-28T17:37:11Z",
            "status": "HEALTHY"
          }
        },
        "status": "RUNNING"
      },
      "math_app": {
        "healthLastUpdateTime": "2025-02-28T17:37:11Z",
        "serveDeploymentStatuses": {
          "Adder": {
            "healthLastUpdateTime": "2025-02-28T17:37:11Z",
            "status": "HEALTHY"
          },
          "Multiplier": {
            "healthLastUpdateTime": "2025-02-28T17:37:11Z",
            "status": "HEALTHY"
          },
          "Router": {
            "healthLastUpdateTime": "2025-02-28T17:37:11Z",
            "status": "HEALTHY"
          }
        },
        "status": "RUNNING"
      }
    },
    "rayClusterName": "rayservice-sample-raycluster-qdpmg",
    "rayClusterStatus": {
      "availableWorkerReplicas": 1,
      "desiredCPU": "2500m",
      "desiredGPU": "0",
      "desiredMemory": "4Gi",
      "desiredTPU": "0",
      "desiredWorkerReplicas": 1,
      "endpoints": {
        "client": "10001",
        "dashboard": "8265",
        "gcs-server": "6379",
        "metrics": "8080",
        "serve": "8000"
      },
      "head": {
        "podIP": "10.244.168.7",
        "serviceIP": "10.244.168.7"
      },
      "lastUpdateTime": "2025-02-28T17:37:11Z",
      "maxWorkerReplicas": 5,
      "minWorkerReplicas": 1,
      "observedGeneration": 1,
      "state": "ready"
    }
  },
  "lastUpdateTime": "2025-02-28T17:37:11Z",
  "numServeEndpoints": 2,
  "observedGeneration": 1,
  "pendingServiceStatus": {
    "rayClusterStatus": {
      "desiredCPU": "0",
      "desiredGPU": "0",
      "desiredMemory": "0",
      "desiredTPU": "0",
      "head": {}
    }
  },
  "serviceStatus": "Running"
}

Note that I tried this and got the same results both on my company's on-premises cluster and on AWS EKS

I don't think this is a problem specifically with the example RayService manifest from the tutorial; we've encountered the same problem on our own manifests--no applications starting up or even showing up on the dashboard.

I also haven't seen any interesting messages from the KubeRay operator pod:

$ kubectl logs kuberay-operator-5dd6779f94-kvmsn | tail -n 5
{"level":"info","ts":"2025-02-28T17:43:09.297Z","logger":"controllers.RayService","msg":"reconcileServices","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"58fd40c2-1e2b-4fde-bd69-3d9fe8559f27","newSvc":{"namespace":"default","name":"rayservice-sample-head-svc"}}
{"level":"info","ts":"2025-02-28T17:43:09.297Z","logger":"controllers.RayService","msg":"RayCluster rayservice-sample-raycluster-qdpmg's headService has already exists, skip Update","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"58fd40c2-1e2b-4fde-bd69-3d9fe8559f27"}
{"level":"info","ts":"2025-02-28T17:43:09.298Z","logger":"controllers.RayService","msg":"reconcileServices","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"58fd40c2-1e2b-4fde-bd69-3d9fe8559f27","serviceType":"serveService","RayService name":"rayservice-sample","RayService namespace":"default"}
{"level":"info","ts":"2025-02-28T17:43:09.298Z","logger":"controllers.RayService","msg":"reconcileServices","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"58fd40c2-1e2b-4fde-bd69-3d9fe8559f27","newSvc":{"namespace":"default","name":"rayservice-sample-serve-svc"}}
{"level":"info","ts":"2025-02-28T17:43:09.298Z","logger":"controllers.RayService","msg":"RayCluster rayservice-sample-raycluster-qdpmg's serveService has already exists, skip Update","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"58fd40c2-1e2b-4fde-bd69-3d9fe8559f27"}

Reproduction script

$ kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-service.sample.yaml

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@boyleconnor boyleconnor added bug Something isn't working triage labels Feb 28, 2025
@boyleconnor boyleconnor changed the title [Bug] Sample RayService does not load on v1.3.0 [Bug] RayService applications do not load on v1.3.0 Feb 28, 2025
@kenchung285
Copy link
Contributor

kenchung285 commented Mar 4, 2025

Could you try kubectl describe pods {YOUR_UNREADY_WORKER_POD_NAME} to see what events you got?
Some reason why your worker Pod is not ready may be displayed

@boyleconnor
Copy link
Author

This time (different from last time, when I ran on our large on-premises server) I am just running on a small AWS EC2 instance, so I had to kubectl edit rayservice and changed all CPU requests/limits to 500m.

Here are the outputs you requested, both for the worker pod and the head pod.

Output of kubectl describe pod on worker
$ kubectl describe pod rayservice-sample-raycluster-vpjz2-small-group-worker-swzmf
Name:             rayservice-sample-raycluster-vpjz2-small-group-worker-swzmf
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-192-168-152-38.us-west-2.compute.internal/192.168.152.38
Start Time:       Mon, 03 Mar 2025 16:47:58 -0800
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  app.kubernetes.io/name=kuberay
                  ray.io/cluster=rayservice-sample-raycluster-vpjz2
                  ray.io/group=small-group
                  ray.io/identifier=rayservice-sample-raycluster-vpjz2-worker
                  ray.io/is-ray-node=yes
                  ray.io/node-type=worker
                  ray.io/serve=true
Annotations:      <none>
Status:           Running
IP:               192.168.130.130
IPs:
  IP:           192.168.130.130
Controlled By:  RayCluster/rayservice-sample-raycluster-vpjz2
Init Containers:
  wait-gcs-ready:
    Container ID:  containerd://7f880b964abf5fd6bd21ee4037aa3a100ad83549bb838428b8e733d56ef7249f
    Image:         rayproject/ray:2.41.0
    Image ID:      docker.io/rayproject/ray@sha256:d57d976a11ba6ef2e2eb6184e8c5523db4b47d1b86b1036d255359824c8d40a0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -lc
      --
    Args:
      
                            SECONDS=0
                            while true; do
                              if (( SECONDS <= 120 )); then
                                if ray health-check --address rayservice-sample-raycluster-vpjz2-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; then
                                  echo "GCS is ready."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
                              else
                                if ray health-check --address rayservice-sample-raycluster-vpjz2-head-svc.default.svc.cluster.local:6379; then
                                  echo "GCS is ready. Any error messages above can be safely ignored."
                                  break
                                fi
                                echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md."
                              fi
                              sleep 5
                            done
                          
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 03 Mar 2025 16:47:59 -0800
      Finished:     Mon, 03 Mar 2025 16:48:09 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:     200m
      memory:  256Mi
    Environment:
      FQ_RAY_IP:  rayservice-sample-raycluster-vpjz2-head-svc.default.svc.cluster.local
      RAY_IP:     rayservice-sample-raycluster-vpjz2-head-svc
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7txdq (ro)
Containers:
  ray-worker:
    Container ID:  containerd://991cfc55bb667be29c715eda92fef168cbaa1df493abf3801d557e46ef23c3cf
    Image:         rayproject/ray:2.41.0
    Image ID:      docker.io/rayproject/ray@sha256:d57d976a11ba6ef2e2eb6184e8c5523db4b47d1b86b1036d255359824c8d40a0
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      -lc
      --
    Args:
      ulimit -n 65536; ray start  --address=rayservice-sample-raycluster-vpjz2-head-svc.default.svc.cluster.local:6379  --block  --dashboard-agent-listen-port=52365  --memory=2147483648  --metrics-export-port=8080  --num-cpus=1 
    State:          Running
      Started:      Mon, 03 Mar 2025 16:48:11 -0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  2Gi
    Requests:
      cpu:      500m
      memory:   2Gi
    Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=2s period=5s #success=1 #failure=120
    Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8000/-/healthz | grep success] delay=10s timeout=2s period=5s #success=1 #failure=1
    Environment:
      FQ_RAY_IP:                                rayservice-sample-raycluster-vpjz2-head-svc.default.svc.cluster.local
      RAY_IP:                                   rayservice-sample-raycluster-vpjz2-head-svc
      RAY_CLUSTER_NAME:                          (v1:metadata.labels['ray.io/cluster'])
      RAY_CLOUD_INSTANCE_ID:                    rayservice-sample-raycluster-vpjz2-small-group-worker-swzmf (v1:metadata.name)
      RAY_NODE_TYPE_NAME:                        (v1:metadata.labels['ray.io/group'])
      KUBERAY_GEN_RAY_START_CMD:                ray start  --address=rayservice-sample-raycluster-vpjz2-head-svc.default.svc.cluster.local:6379  --block  --dashboard-agent-listen-port=52365  --memory=2147483648  --metrics-export-port=8080  --num-cpus=1 
      RAY_PORT:                                 6379
      RAY_timeout_ms_task_wait_for_death_info:  0
      RAY_gcs_server_request_timeout_seconds:   5
      RAY_SERVE_KV_TIMEOUT_S:                   5
      RAY_ADDRESS:                              rayservice-sample-raycluster-vpjz2-head-svc.default.svc.cluster.local:6379
      RAY_USAGE_STATS_KUBERAY_IN_USE:           1
      RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE:      1
    Mounts:
      /dev/shm from shared-mem (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7txdq (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  shared-mem:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  2Gi
  kube-api-access-7txdq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  2m34s             default-scheduler  0/2 nodes are available: 2 Insufficient memory. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod.
  Normal   Scheduled         62s               default-scheduler  Successfully assigned default/rayservice-sample-raycluster-vpjz2-small-group-worker-swzmf to ip-192-168-152-38.us-west-2.compute.internal
  Normal   Pulled            61s               kubelet            Container image "rayproject/ray:2.41.0" already present on machine
  Normal   Created           61s               kubelet            Created container wait-gcs-ready
  Normal   Started           61s               kubelet            Started container wait-gcs-ready
  Normal   Pulled            50s               kubelet            Container image "rayproject/ray:2.41.0" already present on machine
  Normal   Created           50s               kubelet            Created container ray-worker
  Normal   Started           49s               kubelet            Started container ray-worker
  Warning  Unhealthy         1s (x8 over 36s)  kubelet            Readiness probe failed: success
Output of kubectl describe on head pod
$ kubectl describe pod rayservice-sample-raycluster-vpjz2-head-skqmp
Name:             rayservice-sample-raycluster-vpjz2-head-skqmp
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-192-168-104-213.us-west-2.compute.internal/192.168.104.213
Start Time:       Mon, 03 Mar 2025 16:46:26 -0800
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  app.kubernetes.io/name=kuberay
                  ray.io/cluster=rayservice-sample-raycluster-vpjz2
                  ray.io/group=headgroup
                  ray.io/identifier=rayservice-sample-raycluster-vpjz2-head
                  ray.io/is-ray-node=yes
                  ray.io/node-type=head
                  ray.io/serve=false
Annotations:      ray.io/ft-enabled: false
Status:           Running
IP:               192.168.109.73
IPs:
  IP:           192.168.109.73
Controlled By:  RayCluster/rayservice-sample-raycluster-vpjz2
Containers:
  ray-head:
    Container ID:  containerd://930555674850a2298adefa2a775c501da59230acedb38f65e70082b50193e7d4
    Image:         rayproject/ray:2.41.0
    Image ID:      docker.io/rayproject/ray@sha256:d57d976a11ba6ef2e2eb6184e8c5523db4b47d1b86b1036d255359824c8d40a0
    Ports:         6379/TCP, 8265/TCP, 10001/TCP, 8000/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -lc
      --
    Args:
      ulimit -n 65536; ray start --head  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --memory=2147483648  --metrics-export-port=8080  --num-cpus=1 
    State:          Running
      Started:      Mon, 03 Mar 2025 16:46:27 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  2Gi
    Requests:
      cpu:      500m
      memory:   2Gi
    Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=30s timeout=5s period=5s #success=1 #failure=120
    Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=10s timeout=5s period=5s #success=1 #failure=10
    Environment:
      RAY_CLUSTER_NAME:                          (v1:metadata.labels['ray.io/cluster'])
      RAY_CLOUD_INSTANCE_ID:                    rayservice-sample-raycluster-vpjz2-head-skqmp (v1:metadata.name)
      RAY_NODE_TYPE_NAME:                        (v1:metadata.labels['ray.io/group'])
      KUBERAY_GEN_RAY_START_CMD:                ray start --head  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --memory=2147483648  --metrics-export-port=8080  --num-cpus=1 
      RAY_PORT:                                 6379
      RAY_timeout_ms_task_wait_for_death_info:  0
      RAY_gcs_server_request_timeout_seconds:   5
      RAY_SERVE_KV_TIMEOUT_S:                   5
      RAY_ADDRESS:                              127.0.0.1:6379
      RAY_USAGE_STATS_KUBERAY_IN_USE:           1
      RAY_USAGE_STATS_EXTRA_TAGS:               kuberay_version=v1.3.0;kuberay_crd=RayService
      RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE:      1
    Mounts:
      /dev/shm from shared-mem (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gwk7d (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  shared-mem:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  2Gi
  kube-api-access-gwk7d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  6m21s                 default-scheduler  Successfully assigned default/rayservice-sample-raycluster-vpjz2-head-skqmp to ip-192-168-104-213.us-west-2.compute.internal
  Normal   Pulled     6m20s                 kubelet            Container image "rayproject/ray:2.41.0" already present on machine
  Normal   Created    6m20s                 kubelet            Created container ray-head
  Normal   Started    6m20s                 kubelet            Started container ray-head
  Warning  Unhealthy  6m5s (x2 over 6m10s)  kubelet            Readiness probe failed:

(Note that it was specifically Readiness probe failed:, with no following text, unlike the worker pod, which said Readiness probe failed: success)

@kevin85421 kevin85421 added P0 Critical issue that should be fixed ASAP and removed triage labels Mar 5, 2025
@andrewsykim
Copy link
Collaborator

@boyleconnor can you share how you upgraded KubeRay on the cluster?

You mentioned this step:

$ helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.2
Release "kuberay-operator" has been upgraded. Happy Helming!
NAME: kuberay-operator
LAST DEPLOYED: Fri Feb 28 09:36:42 2025
NAMESPACE: default
STATUS: deployed
REVISION: 9
TEST SUITE: None

Note that you need to also upgrade the CRD to match the version you are using for helm. See the upgrade guide for more details: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/upgrade-guide.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Critical issue that should be fixed ASAP
Projects
None yet
Development

No branches or pull requests

4 participants