-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple node groups in the model cache CR #4134
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
c070cc5
to
6b6a0d2
Compare
/rerun-all |
/lgtm |
sivanantha321
added a commit
to sivanantha321/kserve
that referenced
this pull request
Dec 20, 2024
Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
yuzisun
pushed a commit
to yuzisun/kserve
that referenced
this pull request
Dec 22, 2024
* Allow multiple node groups in the model cache CR Signed-off-by: Jin Dong <[email protected]> * Fix test --------- Signed-off-by: Jin Dong <[email protected]>
yuzisun
pushed a commit
to yuzisun/kserve
that referenced
this pull request
Dec 22, 2024
* Allow multiple node groups in the model cache CR Signed-off-by: Jin Dong <[email protected]> * Fix test --------- Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Dan Sun <[email protected]>
yuzisun
added a commit
that referenced
this pull request
Dec 22, 2024
* Local Model Node CR (#3978) * init CR Signed-off-by: Gavin Li <[email protected]> * make generate Signed-off-by: Gavin Li <[email protected]> * make manifests Signed-off-by: Gavin Li <[email protected]> * black format Signed-off-by: Gavin Li <[email protected]> * fix generated python code Signed-off-by: Gavin Li <[email protected]> * feedback Signed-off-by: Gavin Li <[email protected]> * more feedback Signed-off-by: Gavin Li <[email protected]> * black format Signed-off-by: Gavin Li <[email protected]> * make manifests Signed-off-by: Gavin Li <[email protected]> --------- Signed-off-by: Gavin Li <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Model cache controller and node agent (#4089) * LocalModelNode Daemonset Controller Skeleton (#4026) * hello world controller Signed-off-by: Gavin Li <[email protected]> * go fmt Signed-off-by: Gavin Li <[email protected]> * daemonset Signed-off-by: Gavin Li <[email protected]> * Update Makefile Co-authored-by: Jin Dong <[email protected]> Signed-off-by: Gavin Li <[email protected]> * make generate Signed-off-by: Gavin Li <[email protected]> * install LocalModelNode CRD Signed-off-by: Gavin Li <[email protected]> * feedback Signed-off-by: Gavin Li <[email protected]> * make manifests Signed-off-by: Gavin Li <[email protected]> * agent Signed-off-by: Gavin Li <[email protected]> Co-authored-by: Jin Dong <[email protected]> * LocalModelController creates LocalModelNode resource for ready nodes (#4036) * Manage localmodelNode Signed-off-by: Jin Dong <[email protected]> * Update patch Signed-off-by: Jin Dong <[email protected]> * Fix rbac Signed-off-by: Jin Dong <[email protected]> * Add a test to controller_test.go Signed-off-by: Jin Dong <[email protected]> * Update pkg/controller/v1alpha1/localmodel/controller.go Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> Co-authored-by: Dan Sun <[email protected]> * Delete from LocalModelNode when the localmodel is deleted (#4053) * Delete model from LocalModelNode Signed-off-by: Jin Dong <[email protected]> * Cleanup code Signed-off-by: Jin Dong <[email protected]> * Cleanup code Signed-off-by: Jin Dong <[email protected]> * Fix lint Signed-off-by: Jin Dong <[email protected]> * Initializer node status map Signed-off-by: Jin Dong <[email protected]> * Address comments Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> * Update Model status from LocalModelNode status (#4056) * Delete model from LocalModelNode Signed-off-by: Jin Dong <[email protected]> * Cleanup code Signed-off-by: Jin Dong <[email protected]> * Cleanup code Signed-off-by: Jin Dong <[email protected]> * Fix lint Signed-off-by: Jin Dong <[email protected]> * Initializer node status map Signed-off-by: Jin Dong <[email protected]> * Update status Signed-off-by: Jin Dong <[email protected]> * Update localmodel node status Signed-off-by: Jin Dong <[email protected]> * Remove job dependency from localmodel controller Signed-off-by: Jin Dong <[email protected]> * Remove some unused lines Signed-off-by: Jin Dong <[email protected]> * Add comments Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> * LocalModelNode Agent that creates download jobs and update statuses from jobs (#4075) * download working Signed-off-by: Gavin Li <[email protected]> * delete working Signed-off-by: Gavin Li <[email protected]> * cleanup Signed-off-by: Gavin Li <[email protected]> * gofmt Signed-off-by: Gavin Li <[email protected]> * Delete model from LocalModelNode Signed-off-by: Jin Dong <[email protected]> * Cleanup code Signed-off-by: Jin Dong <[email protected]> * Fix lint Signed-off-by: Jin Dong <[email protected]> * Initializer node status map Signed-off-by: Jin Dong <[email protected]> * Update status Signed-off-by: Jin Dong <[email protected]> * Update localmodel node status Signed-off-by: Jin Dong <[email protected]> * Remove job dependency from localmodel controller Signed-off-by: Jin Dong <[email protected]> * Remove some unused lines Signed-off-by: Jin Dong <[email protected]> * Add comments Signed-off-by: Jin Dong <[email protected]> * Update manager Signed-off-by: Jin Dong <[email protected]> * Update rbac Signed-off-by: Jin Dong <[email protected]> * Add tests and temporarily remove delete models code Signed-off-by: Jin Dong <[email protected]> * Do not create download jobs if model is already downloaded Signed-off-by: Jin Dong <[email protected]> * remove mislieading log line Signed-off-by: Jin Dong <[email protected]> * Clean up code a little bit Signed-off-by: Jin Dong <[email protected]> * Update configurations Signed-off-by: Jin Dong <[email protected]> * update test Signed-off-by: Jin Dong <[email protected]> * Use a fixed name for the download container Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Gavin Li <[email protected]> Signed-off-by: Jin Dong <[email protected]> Co-authored-by: Gavin Li <[email protected]> * Delete models from local disk when they are not in LocalModelNode spec (#4084) * download working Signed-off-by: Gavin Li <[email protected]> * delete working Signed-off-by: Gavin Li <[email protected]> * Delete model from LocalModelNode Signed-off-by: Jin Dong <[email protected]> * Initializer node status map Signed-off-by: Jin Dong <[email protected]> * Update status Signed-off-by: Jin Dong <[email protected]> * Update localmodel node status Signed-off-by: Jin Dong <[email protected]> * Update manager Signed-off-by: Jin Dong <[email protected]> * Update rbac Signed-off-by: Jin Dong <[email protected]> * Add tests and temporarily remove delete models code Signed-off-by: Jin Dong <[email protected]> * Do not create download jobs if model is already downloaded Signed-off-by: Jin Dong <[email protected]> * Delete function Signed-off-by: Jin Dong <[email protected]> * Update configurations Signed-off-by: Jin Dong <[email protected]> * Add test and Fix deletion code Signed-off-by: Jin Dong <[email protected]> * Use a fixed name for the download container Signed-off-by: Jin Dong <[email protected]> * Remove deleted models from status and periodically trigger reconciliation Signed-off-by: Jin Dong <[email protected]> * Fix storagecontainer permissions and a minor change Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Gavin Li <[email protected]> Signed-off-by: Jin Dong <[email protected]> Co-authored-by: Gavin Li <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Gavin Li <[email protected]> Co-authored-by: Gavin Li <[email protected]> Co-authored-by: Jin Dong <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Update ClusterLocalModel to LocalModelCache (#4105) * Update ClusterLocalModel to LocalModelCache Signed-off-by: Dan Sun <[email protected]> * Fix generation fmt Signed-off-by: Dan Sun <[email protected]> * black fmt Signed-off-by: Dan Sun <[email protected]> * Fix generated code Signed-off-by: Dan Sun <[email protected]> * Run go mod tidy Signed-off-by: Dan Sun <[email protected]> * Fix model status Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Dan Sun <[email protected]> * Fix LocalModelCache controller reconciles deleted resource (#4106) * Fix LocalModel controller reconciles deleted resource Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Rebase Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Fix path base routing e2e workflow Signed-off-by: Sivanantham Chinnaiyan <[email protected]> --------- Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Add namespace to localmodel and localmodelnode ServiceAccount helm chart (#4111) add localmodelnode agent image Signed-off-by: Rituraj Singh <[email protected]> Co-authored-by: Rituraj Singh <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Detect missing models and redownload models (#4095) * another squash Signed-off-by: Jin Dong <[email protected]> * Add JobTTLSecondsAfterFinished option Signed-off-by: Jin Dong <[email protected]> * Update config Signed-off-by: Jin Dong <[email protected]> * Use labels to filter jobs instead of deleting old jobs Signed-off-by: Jin Dong <[email protected]> * Add log in test Signed-off-by: Jin Dong <[email protected]> * Fix test and helm chart Signed-off-by: Jin Dong <[email protected]> * Create a seperate file system utils file Signed-off-by: Jin Dong <[email protected]> * Add comments Signed-off-by: Jin Dong <[email protected]> * Fix status update bug Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Allow multiple node groups in the model cache CR (#4134) * Allow multiple node groups in the model cache CR Signed-off-by: Jin Dong <[email protected]> * Fix test --------- Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Annotation to disable model cache (#4118) Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Clean up jobs in model cache agent (#4140) * Clean up jobs Signed-off-by: Jin Dong <[email protected]> * fix lint Signed-off-by: Jin Dong <[email protected]> * Fix lint Signed-off-by: Jin Dong <[email protected]> * Fix deletion propagation policy Signed-off-by: Jin Dong <[email protected]> * Update test Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Ensure Model root folder exists (#4142) Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Add NodeGroup Name Into PVC Name (#4141) * Add NodeGroup Name Into PVC Name Signed-off-by: Gavin Li <[email protected]> * Add comment to fix multiple node group Signed-off-by: Dan Sun <[email protected]> * fix openvino dependency Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Gavin Li <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Make LocalModel Agent reconcilation frequency configurable (#4143) * Make reconcilation configurable Signed-off-by: Jin Dong <[email protected]> * Fix codegen Signed-off-by: Jin Dong <[email protected]> * Remove a redudant space Signed-off-by: Jin Dong <[email protected]> * Rename config Signed-off-by: Jin Dong <[email protected]> * Fix lint Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> Co-authored-by: Dan Sun <[email protected]> Signed-off-by: Dan Sun <[email protected]> * LocalModelCache Admission Webhook (#4102) * init Signed-off-by: Gavin Li <[email protected]> * broken code Signed-off-by: Gavin Li <[email protected]> * register webhook Signed-off-by: Gavin Li <[email protected]> * rename + working Signed-off-by: Gavin Li <[email protected]> * pass in client Signed-off-by: Gavin Li <[email protected]> * check storageURI Signed-off-by: Gavin Li <[email protected]> --------- Signed-off-by: Gavin Li <[email protected]> Signed-off-by: Dan Sun <[email protected]> * Fix isvc role localmodelcache permission (#4131) * Fix localmodelcache permission for isvc Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Patch localmodelcache webhook for kubeflow overlay Signed-off-by: Sivanantham Chinnaiyan <[email protected]> --------- Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Gavin Li <[email protected]> Signed-off-by: Dan Sun <[email protected]> Signed-off-by: Jin Dong <[email protected]> Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Signed-off-by: Rituraj Singh <[email protected]> Co-authored-by: Gavin Li <[email protected]> Co-authored-by: Gavin Li <[email protected]> Co-authored-by: Jin Dong <[email protected]> Co-authored-by: Sivanantham <[email protected]> Co-authored-by: Rituraj Singh <[email protected]> Co-authored-by: Rituraj Singh <[email protected]>
sivanantha321
added a commit
to sivanantha321/kserve
that referenced
this pull request
Dec 24, 2024
Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
bentohset
pushed a commit
to bentohset/kserve
that referenced
this pull request
Dec 26, 2024
* Allow multiple node groups in the model cache CR Signed-off-by: Jin Dong <[email protected]> * Fix test --------- Signed-off-by: Jin Dong <[email protected]> Signed-off-by: bentohset <[email protected]>
bentohset
pushed a commit
to bentohset/kserve
that referenced
this pull request
Dec 26, 2024
* Allow multiple node groups in the model cache CR Signed-off-by: Jin Dong <[email protected]> * Fix test --------- Signed-off-by: Jin Dong <[email protected]> Signed-off-by: bentohset <[email protected]>
9 tasks
sivanantha321
added a commit
to sivanantha321/kserve
that referenced
this pull request
Jan 6, 2025
Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
yuzisun
pushed a commit
that referenced
this pull request
Jan 11, 2025
* Add client sdk for localmodelcache, localmodelnodegroup Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Add e2e test for modelcache Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Use docker driver and minikube tunnel Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Merge "Allow multiple node groups in the model cache CR (#4134)" Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Try mounting image dir Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Add local model agent to image scan Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Debug Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Create model root directory beforehand Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Restart kserve controller after patch Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Enablepvc direct mount in e2e test Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Reduce pv storage to 1GB Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Update modelcache test Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Update status-check to include modelcache logs Signed-off-by: Sivanantham Chinnaiyan <[email protected]> --------- Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
gavrissh
added a commit
to gavrissh/kserve
that referenced
this pull request
Jan 17, 2025
Signed-off-by: Gavrish Prabhu <[email protected]> fix formatting Signed-off-by: Gavrish Prabhu <[email protected]> add vllm to poetry Signed-off-by: Gavrish Prabhu <[email protected]> Add affinity and tolerations to localmodel daemonset (kserve#4173) * Add affinity and tolerations to localmodel daemonset Signed-off-by: Jin Dong <[email protected]> * make generate Signed-off-by: Jin Dong <[email protected]> --------- Signed-off-by: Jin Dong <[email protected]> Fix s3 download PermanentRedirectError for legacy s3 endpoint (kserve#4157) * sets virtual addressing style for legacy s3 endpoint Signed-off-by: bentohset <[email protected]> * add unit test Signed-off-by: bentohset <[email protected]> * fix formatting Signed-off-by: bentohset <[email protected]> * fix unit tests Signed-off-by: bentohset <[email protected]> --------- Signed-off-by: bentohset <[email protected]> Co-authored-by: Lize Cai <[email protected]> Make label and annotation propagation configurable (kserve#4030) * Make label and annotation propagation configurable chore: Make the DisallaowedAnnotations and Labels configurable through ConfigMap so users can configured it quickly. fixes kserve#3710 Signed-off-by: Spolti <[email protected]> * generate boilerplate code Signed-off-by: Spolti <[email protected]> * Edgar's review changes Signed-off-by: Spolti <[email protected]> --------- Signed-off-by: Spolti <[email protected]> Add ModelCache e2e test (kserve#4136) * Add client sdk for localmodelcache, localmodelnodegroup Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Add e2e test for modelcache Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Use docker driver and minikube tunnel Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Merge "Allow multiple node groups in the model cache CR (kserve#4134)" Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Try mounting image dir Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Add local model agent to image scan Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Debug Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Create model root directory beforehand Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Restart kserve controller after patch Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Enablepvc direct mount in e2e test Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Reduce pv storage to 1GB Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Update modelcache test Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Update status-check to include modelcache logs Signed-off-by: Sivanantham Chinnaiyan <[email protected]> --------- Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Update vllm to 0.6.6 (kserve#4176) Signed-off-by: Rajat Vig <[email protected]> Co-authored-by: Dan Sun <[email protected]> [bugfix] fix s3 storage download filename bug (kserve#4162) * [bugfix] fix s3 storage download filename bug - ensure correct path and file name preservation during s3 downloads in storage-initializer Signed-off-by: Jaeyeon Kim <[email protected]> * update lint - fix format Signed-off-by: Jaeyeon Kim <[email protected]> * fix format Signed-off-by: Jaeyeon Kim <[email protected]> --------- Signed-off-by: Jaeyeon Kim <[email protected]> update lint fix Signed-off-by: Gavrish Prabhu <[email protected]> update lint fix Signed-off-by: Gavrish Prabhu <[email protected]> update lint fix Signed-off-by: Gavrish Prabhu <[email protected]> openai model test Signed-off-by: Gavrish Prabhu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
First step to fix #4126 , this pr just make node group a list in the ModelCache resource.
Currently one big issue is that if an model is cached in one type of gpu nodes. It cannot be deployed to other type of gpu nodes because the the inference service would try to use local models since it doesn't know which node it is deployed on.
KServe might need to know where the model is cached to decide whether to use local models in the future.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #4126
Type of changes
Please delete options that are not relevant.
Feature/Issue validation/testing:
Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Test B
Logs
Special notes for your reviewer:
Checklist:
Release note:
Re-running failed tests
/rerun-all
- rerun all failed workflows./rerun-workflow <workflow name>
- rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.