[Bug] Ray can not schedule the workload when there is enough resource #3151

Zheaoli · 2025-03-04T09:09:29Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Here's the config

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: demo-recsys-voyager
spec:
  serveService:
    metadata:
      name: demo-recsys-voyager
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-name: demo-recsys-voyager
        service.beta.kubernetes.io/aws-load-balancer-scheme: internal
    spec:
      type: LoadBalancer
      ports:
        - port: 6379
          targetPort: 6379
          protocol: TCP
          name: gcs
        - port: 8265
          targetPort: 8265
          protocol: TCP
          name: dashboard
        - port: 10001
          targetPort: 10001
          protocol: TCP
          name: client
        - port: 8000
          targetPort: 8000
          protocol: TCP
          name: serve
  serveConfigV2: |
    applications:
      - name: demo_recsys_retrieval_voyager_index_service
        route_prefix: /v1/vectors
        import_path: demo_recsys.indexing.server:demo_recsys_indexing_service_entrypoint
        runtime_env: 
          working_dir: "s3://demo-recsys-use1/serving/releases/v1.2.0.zip"
          py_modules: ["s3://demo-recsys-use1/serving/releases/demo_recsys-1.2.0-py3-none-any.whl"]
          pip: [
            "click", "datasets==3.2.0", "torch==2.5.0", "torchrec==1.0.0", "fbgemm-gpu==1.0.0", "torchmetrics",
            "mmh3==5.0.1", "pandas","pydantic==2.10.4", "ray[serve]==2.43.0", "rbloom==1.5.2", "rich",
            "safetensors==0.4.5", "psycopg[binary,pool]==3.2.3", "pgvector==0.3.6", "redis>=5.2.1", 
            "boto3>=1.35.97", "pydantic-settings>=2.7.1", "ujson>=5.10.0", "aiohttp[speedups]>=3.11.11", 
            "pinecone==5.4.2", "voyager>=2.1.0", "duckdb>=1.2.0",
          ]
        deployments:
        - name: demoRetrievalVoyagerIndexServiceHandler
          num_replicas: 1
          ray_actor_options:
            num_cpus: 2
        - name: demoRetrievalVoyagerIndexServiceIngress
          num_replicas: 1
          ray_actor_options:
            num_cpus: 2
  rayClusterConfig:
    rayVersion: '2.43.0' # Should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    autoscalerOptions:
      idleTimeoutSeconds: 1
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.40.0-py312-cpu
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
            resources:
              limits:
                cpu: "2"
                memory: "8G"
              requests:
                cpu: "1"
                memory: "2G"
          volumes:
            - name: ray-logs
              emptyDir: {}
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 4
      groupName: ray-serve-worker-group
      rayStartParams: {}
      # Pod template
      template:
        spec:
          initContainers:
          - name: test
            image: amazon/aws-cli
            command:
            - sh
            - -c
            - |
              echo "Hello, World!"
          containers:
          - name: ray-worker
            image: rayproject/ray:2.40.0-py312-cpu
            resources:
              limits:
                cpu: 12
                memory: "32G"
                # nvidia.com/gpu: 1
              requests:
                cpu: 6
                memory: "24G"
            volumeMounts:
              - mountPath: /tmp
                name: tmp-ray
                # nvidia.com/gpu: 1
          # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"
          volumes:
            - name: tmp-ray
              persistentVolumeClaim:
                claimName: efs-pvc

Reproduction script

Here's exception

WARNING 2025-03-04 01:04:33,560 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:05:03,580 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:05:33,665 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:06:03,678 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:06:33,747 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:07:03,847 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:07:33,949 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:08:04,031 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2025-03-05T18:32:09Z

Is it possible for the serve applications spending time on the pip install? Can you try to bake all the dependencies in the image and try again? If the issue still exists, we can take a look.

Zheaoli added bug Something isn't working triage labels Mar 4, 2025

kevin85421 added ray question Further information is requested and removed triage labels Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Ray can not schedule the workload when there is enough resource #3151

[Bug] Ray can not schedule the workload when there is enough resource #3151

Zheaoli commented Mar 4, 2025

kevin85421 commented Mar 5, 2025

[Bug] Ray can not schedule the workload when there is enough resource #3151

[Bug] Ray can not schedule the workload when there is enough resource #3151

Comments

Zheaoli commented Mar 4, 2025

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Mar 5, 2025