-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve model adapter reliability and stability #257
Conversation
ModelAdapter will take the response
@@ -296,6 +351,14 @@ func (r *ModelAdapterReconciler) DoReconcile(ctx context.Context, req ctrl.Reque | |||
return ctrl.Result{}, r.clearModelAdapterInstanceList(ctx, instance) | |||
} | |||
|
|||
if !utils.IsPodReady(selectedPod) || utils.IsPodTerminating(selectedPod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have wait period or retry count. Because right now if pod going thru rolling restart, then adapters will crash loop with them.
For example, in rolling restart for pod A, all adapters are scheduled on pod B, then pod B restarts and adapters move to pod C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because right now if pod going thru rolling restart, then adapters will crash loop with them.
Once the pod become terminating, the controller enqueue the adapter, here it just remove the lora from instance list.
should we have wait period or retry count. For example, in rolling restart for pod A, all adapters are scheduled on pod B, then pod B restarts and adapters move to pod C.
This is expected behavior with this change. the wait period or retry count you mean for which operation?
Overall looks good to me. Two things, 1) should there we wait time or retry count before moving adapter to another pod if that pod is not ready. 2) In the case of pod is deleted then entire instance list is invalidated for that model adapter, should it only invalidate that pod or entire list. Since the PR is pretty big, these can be addressed separately also. |
Got your point now, if the adapter fails to find a target, it will come into pending status. If there's new pod ready, it will enqueue all adapters, it's similar logic like pod scheduler now.
Yeah, this is a good point, right now, it only support single lora replica, once #205 is supported, it will address this issue and only remove the deleted or expired instance from the list |
Talked with @varungup90 offline, create an issue to track the 1st comment #258 |
* Standarize pod labels and filter out unrelated pod * Enqueue the model adapter object from pod changes * Remove the base model deletion bug ModelAdapter will take the response
Pull Request Description
predicates
to filter out unrelated podsRelated Issues
Resolves:
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]
: Corrections to existing functionality[CI]
: Changes to build process or CI pipeline[Docs]
: Updates or additions to documentation[API]
: Modifications to aibrix's API or interface[CLI]
: Changes or additions to the Command Line Interface[Misc]
: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.