Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Ray can not schedule the workload when there is enough resource #3151

Open
1 of 2 tasks
Zheaoli opened this issue Mar 4, 2025 · 1 comment
Open
1 of 2 tasks
Labels
bug Something isn't working question Further information is requested ray

Comments

@Zheaoli
Copy link

Zheaoli commented Mar 4, 2025

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Here's the config

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: demo-recsys-voyager
spec:
  serveService:
    metadata:
      name: demo-recsys-voyager
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-name: demo-recsys-voyager
        service.beta.kubernetes.io/aws-load-balancer-scheme: internal
    spec:
      type: LoadBalancer
      ports:
        - port: 6379
          targetPort: 6379
          protocol: TCP
          name: gcs
        - port: 8265
          targetPort: 8265
          protocol: TCP
          name: dashboard
        - port: 10001
          targetPort: 10001
          protocol: TCP
          name: client
        - port: 8000
          targetPort: 8000
          protocol: TCP
          name: serve
  serveConfigV2: |
    applications:
      - name: demo_recsys_retrieval_voyager_index_service
        route_prefix: /v1/vectors
        import_path: demo_recsys.indexing.server:demo_recsys_indexing_service_entrypoint
        runtime_env: 
          working_dir: "s3://demo-recsys-use1/serving/releases/v1.2.0.zip"
          py_modules: ["s3://demo-recsys-use1/serving/releases/demo_recsys-1.2.0-py3-none-any.whl"]
          pip: [
            "click", "datasets==3.2.0", "torch==2.5.0", "torchrec==1.0.0", "fbgemm-gpu==1.0.0", "torchmetrics",
            "mmh3==5.0.1", "pandas","pydantic==2.10.4", "ray[serve]==2.43.0", "rbloom==1.5.2", "rich",
            "safetensors==0.4.5", "psycopg[binary,pool]==3.2.3", "pgvector==0.3.6", "redis>=5.2.1", 
            "boto3>=1.35.97", "pydantic-settings>=2.7.1", "ujson>=5.10.0", "aiohttp[speedups]>=3.11.11", 
            "pinecone==5.4.2", "voyager>=2.1.0", "duckdb>=1.2.0",
          ]
        deployments:
        - name: demoRetrievalVoyagerIndexServiceHandler
          num_replicas: 1
          ray_actor_options:
            num_cpus: 2
        - name: demoRetrievalVoyagerIndexServiceIngress
          num_replicas: 1
          ray_actor_options:
            num_cpus: 2
  rayClusterConfig:
    rayVersion: '2.43.0' # Should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    autoscalerOptions:
      idleTimeoutSeconds: 1
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.40.0-py312-cpu
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
            resources:
              limits:
                cpu: "2"
                memory: "8G"
              requests:
                cpu: "1"
                memory: "2G"
          volumes:
            - name: ray-logs
              emptyDir: {}
    workerGroupSpecs:
    # The pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 4
      groupName: ray-serve-worker-group
      rayStartParams: {}
      # Pod template
      template:
        spec:
          initContainers:
          - name: test
            image: amazon/aws-cli
            command:
            - sh
            - -c
            - |
              echo "Hello, World!"
          containers:
          - name: ray-worker
            image: rayproject/ray:2.40.0-py312-cpu
            resources:
              limits:
                cpu: 12
                memory: "32G"
                # nvidia.com/gpu: 1
              requests:
                cpu: 6
                memory: "24G"
            volumeMounts:
              - mountPath: /tmp
                name: tmp-ray
                # nvidia.com/gpu: 1
          # Please add the following taints to the GPU node.
          tolerations:
            - key: "ray.io/node-type"
              operator: "Equal"
              value: "worker"
              effect: "NoSchedule"
          volumes:
            - name: tmp-ray
              persistentVolumeClaim:
                claimName: efs-pvc

Reproduction script

Here's exception

WARNING 2025-03-04 01:04:33,560 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:05:03,580 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:05:33,665 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:06:03,678 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:06:33,747 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:07:03,847 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:07:33,949 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details.
WARNING 2025-03-04 01:08:04,031 controller 306 -- Deployment 'demoRetrievalVoyagerIndexServiceHandler' in application 'demo_recsys_retrieval_voyager_index_service' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 2.0}, total resources available: {"CPU": 10.0}. Use `ray status` for more details

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Zheaoli Zheaoli added bug Something isn't working triage labels Mar 4, 2025
@kevin85421
Copy link
Member

Is it possible for the serve applications spending time on the pip install? Can you try to bake all the dependencies in the image and try again? If the issue still exists, we can take a look.

@kevin85421 kevin85421 added ray question Further information is requested and removed triage labels Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested ray
Projects
None yet
Development

No branches or pull requests

2 participants