Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple node groups in the model cache CR #4134

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

greenmoon55
Copy link
Contributor

@greenmoon55 greenmoon55 commented Dec 17, 2024

What this PR does / why we need it:
First step to fix #4126 , this pr just make node group a list in the ModelCache resource.
Currently one big issue is that if an model is cached in one type of gpu nodes. It cannot be deployed to other type of gpu nodes because the the inference service would try to use local models since it doesn't know which node it is deployed on.
KServe might need to know where the model is cached to decide whether to use local models in the future.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4126

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A

  • Test B

  • Logs

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:


Re-running failed tests

  • /rerun-all - rerun all failed workflows.
  • /rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

@greenmoon55 greenmoon55 marked this pull request as ready for review December 17, 2024 23:30
Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
@greenmoon55
Copy link
Contributor Author

/rerun-all

@yuzisun
Copy link
Member

yuzisun commented Dec 19, 2024

/lgtm
/approve

@github-actions github-actions bot added the lgtm label Dec 19, 2024
@yuzisun yuzisun merged commit 9b2fc4b into kserve:master Dec 19, 2024
63 checks passed
@greenmoon55 greenmoon55 deleted the multiple-nodegroups branch December 19, 2024 16:25
sivanantha321 added a commit to sivanantha321/kserve that referenced this pull request Dec 20, 2024
yuzisun pushed a commit to yuzisun/kserve that referenced this pull request Dec 22, 2024
* Allow multiple node groups in the model cache CR

Signed-off-by: Jin Dong <[email protected]>

* Fix test

---------

Signed-off-by: Jin Dong <[email protected]>
yuzisun pushed a commit to yuzisun/kserve that referenced this pull request Dec 22, 2024
* Allow multiple node groups in the model cache CR

Signed-off-by: Jin Dong <[email protected]>

* Fix test

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
yuzisun added a commit that referenced this pull request Dec 22, 2024
* Local Model Node CR (#3978)

* init CR

Signed-off-by: Gavin Li <[email protected]>

* make generate

Signed-off-by: Gavin Li <[email protected]>

* make manifests

Signed-off-by: Gavin Li <[email protected]>

* black format

Signed-off-by: Gavin Li <[email protected]>

* fix generated python code

Signed-off-by: Gavin Li <[email protected]>

* feedback

Signed-off-by: Gavin Li <[email protected]>

* more feedback

Signed-off-by: Gavin Li <[email protected]>

* black format

Signed-off-by: Gavin Li <[email protected]>

* make manifests

Signed-off-by: Gavin Li <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Model cache controller and node agent  (#4089)

* LocalModelNode Daemonset Controller Skeleton (#4026)

* hello world controller

Signed-off-by: Gavin Li <[email protected]>

* go fmt

Signed-off-by: Gavin Li <[email protected]>

* daemonset

Signed-off-by: Gavin Li <[email protected]>

* Update Makefile

Co-authored-by: Jin Dong <[email protected]>
Signed-off-by: Gavin Li <[email protected]>

* make generate

Signed-off-by: Gavin Li <[email protected]>

* install LocalModelNode CRD

Signed-off-by: Gavin Li <[email protected]>

* feedback

Signed-off-by: Gavin Li <[email protected]>

* make manifests

Signed-off-by: Gavin Li <[email protected]>

* agent

Signed-off-by: Gavin Li <[email protected]>

Co-authored-by: Jin Dong <[email protected]>

* LocalModelController creates LocalModelNode resource for ready nodes (#4036)

* Manage localmodelNode

Signed-off-by: Jin Dong <[email protected]>

* Update patch

Signed-off-by: Jin Dong <[email protected]>

* Fix rbac

Signed-off-by: Jin Dong <[email protected]>

* Add a test to controller_test.go

Signed-off-by: Jin Dong <[email protected]>

* Update pkg/controller/v1alpha1/localmodel/controller.go

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

* Delete from LocalModelNode when the localmodel is deleted (#4053)

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Address comments

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>

* Update Model status from LocalModelNode status (#4056)

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Update status

Signed-off-by: Jin Dong <[email protected]>

* Update localmodel node status

Signed-off-by: Jin Dong <[email protected]>

* Remove job dependency from localmodel controller

Signed-off-by: Jin Dong <[email protected]>

* Remove some unused lines

Signed-off-by: Jin Dong <[email protected]>

* Add comments

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>

* LocalModelNode Agent that creates download jobs and update statuses from jobs (#4075)

* download working

Signed-off-by: Gavin Li <[email protected]>

* delete working

Signed-off-by: Gavin Li <[email protected]>

* cleanup

Signed-off-by: Gavin Li <[email protected]>

* gofmt

Signed-off-by: Gavin Li <[email protected]>

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Update status

Signed-off-by: Jin Dong <[email protected]>

* Update localmodel node status

Signed-off-by: Jin Dong <[email protected]>

* Remove job dependency from localmodel controller

Signed-off-by: Jin Dong <[email protected]>

* Remove some unused lines

Signed-off-by: Jin Dong <[email protected]>

* Add comments

Signed-off-by: Jin Dong <[email protected]>

* Update manager

Signed-off-by: Jin Dong <[email protected]>

* Update rbac

Signed-off-by: Jin Dong <[email protected]>

* Add tests and temporarily remove delete models code

Signed-off-by: Jin Dong <[email protected]>

* Do not create download jobs if model is already downloaded

Signed-off-by: Jin Dong <[email protected]>

* remove mislieading log line

Signed-off-by: Jin Dong <[email protected]>

* Clean up code a little bit

Signed-off-by: Jin Dong <[email protected]>

* Update configurations

Signed-off-by: Jin Dong <[email protected]>

* update test

Signed-off-by: Jin Dong <[email protected]>

* Use a fixed name for the download container

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Co-authored-by: Gavin Li <[email protected]>

* Delete models from local disk when they are not in LocalModelNode spec (#4084)

* download working

Signed-off-by: Gavin Li <[email protected]>

* delete working

Signed-off-by: Gavin Li <[email protected]>

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Update status

Signed-off-by: Jin Dong <[email protected]>

* Update localmodel node status

Signed-off-by: Jin Dong <[email protected]>

* Update manager

Signed-off-by: Jin Dong <[email protected]>

* Update rbac

Signed-off-by: Jin Dong <[email protected]>

* Add tests and temporarily remove delete models code

Signed-off-by: Jin Dong <[email protected]>

* Do not create download jobs if model is already downloaded

Signed-off-by: Jin Dong <[email protected]>

* Delete function

Signed-off-by: Jin Dong <[email protected]>

* Update configurations

Signed-off-by: Jin Dong <[email protected]>

* Add test and Fix deletion code

Signed-off-by: Jin Dong <[email protected]>

* Use a fixed name for the download container

Signed-off-by: Jin Dong <[email protected]>

* Remove deleted models from status and periodically trigger reconciliation

Signed-off-by: Jin Dong <[email protected]>

* Fix storagecontainer permissions and a minor change

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Co-authored-by: Gavin Li <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Gavin Li <[email protected]>
Co-authored-by: Gavin Li <[email protected]>
Co-authored-by: Jin Dong <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Update ClusterLocalModel to LocalModelCache (#4105)

* Update ClusterLocalModel to LocalModelCache

Signed-off-by: Dan Sun <[email protected]>

* Fix generation fmt

Signed-off-by: Dan Sun <[email protected]>

* black fmt

Signed-off-by: Dan Sun <[email protected]>

* Fix generated code

Signed-off-by: Dan Sun <[email protected]>

* Run go mod tidy

Signed-off-by: Dan Sun <[email protected]>

* Fix model status

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>

* Fix LocalModelCache controller reconciles deleted resource (#4106)

* Fix LocalModel controller reconciles deleted resource

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Rebase

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Fix path base routing e2e workflow

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Add namespace to localmodel and localmodelnode ServiceAccount helm chart (#4111)

add localmodelnode agent image

Signed-off-by: Rituraj Singh <[email protected]>
Co-authored-by: Rituraj Singh <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Detect missing models and redownload models (#4095)

* another squash

Signed-off-by: Jin Dong <[email protected]>

* Add JobTTLSecondsAfterFinished option

Signed-off-by: Jin Dong <[email protected]>

* Update config

Signed-off-by: Jin Dong <[email protected]>

* Use labels to filter jobs instead of deleting old jobs

Signed-off-by: Jin Dong <[email protected]>

* Add log in test

Signed-off-by: Jin Dong <[email protected]>

* Fix test and helm chart

Signed-off-by: Jin Dong <[email protected]>

* Create a seperate file system utils file

Signed-off-by: Jin Dong <[email protected]>

* Add comments

Signed-off-by: Jin Dong <[email protected]>

* Fix status update bug

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Allow multiple node groups in the model cache CR (#4134)

* Allow multiple node groups in the model cache CR

Signed-off-by: Jin Dong <[email protected]>

* Fix test

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Annotation to disable model cache (#4118)

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Clean up jobs in model cache agent (#4140)

* Clean up jobs

Signed-off-by: Jin Dong <[email protected]>

* fix lint

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

* Fix deletion propagation policy

Signed-off-by: Jin Dong <[email protected]>

* Update test

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Ensure Model root folder exists (#4142)

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Add NodeGroup Name Into PVC Name (#4141)

* Add NodeGroup Name Into PVC Name

Signed-off-by: Gavin Li <[email protected]>

* Add comment to fix multiple node group

Signed-off-by: Dan Sun <[email protected]>

* fix openvino dependency

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Make LocalModel Agent reconcilation frequency configurable (#4143)

* Make reconcilation configurable

Signed-off-by: Jin Dong <[email protected]>

* Fix codegen

Signed-off-by: Jin Dong <[email protected]>

* Remove a redudant space

Signed-off-by: Jin Dong <[email protected]>

* Rename config

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* LocalModelCache Admission Webhook (#4102)

* init

Signed-off-by: Gavin Li <[email protected]>

* broken code

Signed-off-by: Gavin Li <[email protected]>

* register webhook

Signed-off-by: Gavin Li <[email protected]>

* rename + working

Signed-off-by: Gavin Li <[email protected]>

* pass in client

Signed-off-by: Gavin Li <[email protected]>

* check storageURI

Signed-off-by: Gavin Li <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

* Fix isvc role localmodelcache permission (#4131)

* Fix localmodelcache permission for isvc

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Patch localmodelcache webhook for kubeflow overlay

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Rituraj Singh <[email protected]>
Co-authored-by: Gavin Li <[email protected]>
Co-authored-by: Gavin Li <[email protected]>
Co-authored-by: Jin Dong <[email protected]>
Co-authored-by: Sivanantham <[email protected]>
Co-authored-by: Rituraj Singh <[email protected]>
Co-authored-by: Rituraj Singh <[email protected]>
sivanantha321 added a commit to sivanantha321/kserve that referenced this pull request Dec 24, 2024
bentohset pushed a commit to bentohset/kserve that referenced this pull request Dec 26, 2024
* Allow multiple node groups in the model cache CR

Signed-off-by: Jin Dong <[email protected]>

* Fix test

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: bentohset <[email protected]>
bentohset pushed a commit to bentohset/kserve that referenced this pull request Dec 26, 2024
* Allow multiple node groups in the model cache CR

Signed-off-by: Jin Dong <[email protected]>

* Fix test

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: bentohset <[email protected]>
sivanantha321 added a commit to sivanantha321/kserve that referenced this pull request Jan 6, 2025
yuzisun pushed a commit that referenced this pull request Jan 11, 2025
* Add client sdk for localmodelcache, localmodelnodegroup

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add e2e test for modelcache

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Use docker driver and minikube tunnel

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Merge "Allow multiple node groups in the model cache CR (#4134)"

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Try mounting image dir

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add local model agent to image scan

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Debug

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Create model root directory beforehand

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Restart kserve controller after patch

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Enablepvc direct mount in e2e test

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Reduce pv storage to 1GB

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update modelcache test

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update status-check to include modelcache logs

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
gavrissh added a commit to gavrissh/kserve that referenced this pull request Jan 17, 2025
Signed-off-by: Gavrish Prabhu <[email protected]>

fix formatting

Signed-off-by: Gavrish Prabhu <[email protected]>

add vllm to poetry

Signed-off-by: Gavrish Prabhu <[email protected]>

Add affinity and tolerations to localmodel daemonset (kserve#4173)

* Add affinity and tolerations to localmodel daemonset

Signed-off-by: Jin Dong <[email protected]>

* make generate

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>

Fix s3 download PermanentRedirectError for legacy s3 endpoint (kserve#4157)

* sets virtual addressing style for legacy s3 endpoint

Signed-off-by: bentohset <[email protected]>

* add unit test

Signed-off-by: bentohset <[email protected]>

* fix formatting

Signed-off-by: bentohset <[email protected]>

* fix unit tests

Signed-off-by: bentohset <[email protected]>

---------

Signed-off-by: bentohset <[email protected]>
Co-authored-by: Lize Cai <[email protected]>

Make label and annotation propagation configurable (kserve#4030)

* Make label and annotation propagation configurable

chore:	Make the DisallaowedAnnotations and Labels configurable through
	ConfigMap so users can configured it quickly.

fixes kserve#3710

Signed-off-by: Spolti <[email protected]>

* generate boilerplate code

Signed-off-by: Spolti <[email protected]>

* Edgar's review changes

Signed-off-by: Spolti <[email protected]>

---------

Signed-off-by: Spolti <[email protected]>

Add ModelCache e2e test (kserve#4136)

* Add client sdk for localmodelcache, localmodelnodegroup

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add e2e test for modelcache

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Use docker driver and minikube tunnel

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Merge "Allow multiple node groups in the model cache CR (kserve#4134)"

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Try mounting image dir

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add local model agent to image scan

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Debug

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Create model root directory beforehand

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Restart kserve controller after patch

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Enablepvc direct mount in e2e test

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Reduce pv storage to 1GB

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update modelcache test

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update status-check to include modelcache logs

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

Update vllm to 0.6.6 (kserve#4176)

Signed-off-by: Rajat Vig <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

[bugfix] fix s3 storage download filename bug (kserve#4162)

* [bugfix] fix s3 storage download filename bug

- ensure correct path and file name preservation during s3 downloads in
  storage-initializer

Signed-off-by: Jaeyeon Kim <[email protected]>

* update lint

- fix format

Signed-off-by: Jaeyeon Kim <[email protected]>

* fix format

Signed-off-by: Jaeyeon Kim <[email protected]>

---------

Signed-off-by: Jaeyeon Kim <[email protected]>

update lint fix

Signed-off-by: Gavrish Prabhu <[email protected]>

update lint fix

Signed-off-by: Gavrish Prabhu <[email protected]>

update lint fix

Signed-off-by: Gavrish Prabhu <[email protected]>

openai model test

Signed-off-by: Gavrish Prabhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multiple LocalModel NodeGroup
2 participants