-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cannot submit a Ray job to an existing cluster #5877
Comments
I think the easiest thing to do is to rework the documentation to be correct for the current behavior. It looks like updating the module to submit work to an existing cluster would be non-trivial. |
@gpgn qq: do you want to use Kuberay to submit a job or use Flyte to create pod that connect to your ray cluster? |
@pingsutw We have an existing Ray cluster on our Kubernetes cluster and wanted to try and submit a job to that via Flyte. |
After spending quite a bit of time looking at the Ray plugin code I can say that is absolutely does not cleanly support running a remote submitter that talks to an existing cluster. The plugin makes a lot of assumptions about head and worker nodes being configured and will try to submit ray custom resources. This will need quite a bit of reworking. |
Describe the bug
Asked to create an issue from this thread. The workaround is simple, but documenting the issue in any case.
I’m setting up the integration with Ray, and it seems to work nicely when creating a fresh RayCluster using:
@task(task_config=RayJobConfig(worker_node_config=[WorkerNodeConfig(…)])))
.I can see the cluster starting, the job getting scheduled and distributed, and completing successfully.
I’m having trouble with using an existing RayCluster (in the same cluster) though. From the docs here I read that I should be able to use:
@task(task_config=RayJobConfig(address="<RAY_CLUSTER_ADDRESS>"))
.However when trying that it seems
worker_node_config
is a required argument. I tried using an empty list instead:But then it still tries to start a new RayCluster instead of using the existing one found at address:
The
address
works fine if I just run:It looks like the
worker_node_config
argument has been required since the initial commit, and we can't seem to find code that submits a job without creating a new cluster. Not sure how the docs example has ever worked?This seems to work as a simple workaround:
Expected behavior
I'm not sure if this is something Flyte wants to support (from the docs it looks like it, but then there is no code to do it)? Either the docs could be updated to remove the option to submit to an existing cluster from the docs, or have the example there work accordingly and not start a new RayCluster and instead submit it to the one existing at
address
by omittingworker_node_config
when doing:Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: