Improve model adapter reliability and stability #257

Jeffwan · 2024-09-30T17:05:28Z

Pull Request Description

Standarize pod labels names
Add predicates to filter out unrelated pods
Enqueue the model adapter object from pod changes
Fix model adapter bug after removing base model

Related Issues

Resolves:

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

ModelAdapter will take the response

config/gateway/kustomization.yaml

Makefile

varungup90 · 2024-09-30T17:47:18Z

pkg/controller/modeladapter/modeladapter_controller.go

@@ -296,6 +351,14 @@ func (r *ModelAdapterReconciler) DoReconcile(ctx context.Context, req ctrl.Reque
 				return ctrl.Result{}, r.clearModelAdapterInstanceList(ctx, instance)
 			}

+			if !utils.IsPodReady(selectedPod) || utils.IsPodTerminating(selectedPod) {


should we have wait period or retry count. Because right now if pod going thru rolling restart, then adapters will crash loop with them.

For example, in rolling restart for pod A, all adapters are scheduled on pod B, then pod B restarts and adapters move to pod C.

Because right now if pod going thru rolling restart, then adapters will crash loop with them.

Once the pod become terminating, the controller enqueue the adapter, here it just remove the lora from instance list.

should we have wait period or retry count. For example, in rolling restart for pod A, all adapters are scheduled on pod B, then pod B restarts and adapters move to pod C.

This is expected behavior with this change. the wait period or retry count you mean for which operation?

varungup90 · 2024-09-30T18:00:18Z

Overall looks good to me. Two things, 1) should there we wait time or retry count before moving adapter to another pod if that pod is not ready. 2) In the case of pod is deleted then entire instance list is invalidated for that model adapter, should it only invalidate that pod or entire list.

Since the PR is pretty big, these can be addressed separately also.

Jeffwan · 2024-09-30T18:09:46Z

Overall looks good to me. Two things, 1) should there we wait time or retry count before moving adapter to another pod if that pod is not ready.

Got your point now, if the adapter fails to find a target, it will come into pending status. If there's new pod ready, it will enqueue all adapters, it's similar logic like pod scheduler now.

In the case of pod is deleted then entire instance list is invalidated for that model adapter, should it only invalidate that pod or entire list.

Yeah, this is a good point, right now, it only support single lora replica, once #205 is supported, it will address this issue and only remove the deleted or expired instance from the list

Jeffwan · 2024-09-30T18:21:04Z

Talked with @varungup90 offline, create an issue to track the 1st comment #258

* Standarize pod labels and filter out unrelated pod * Enqueue the model adapter object from pod changes * Remove the base model deletion bug ModelAdapter will take the response

Jeffwan added 3 commits September 29, 2024 10:44

Standarize pod labels and filter out unrelated pod

28603a0

Enqueue the model adapter object from pod changes

efd9fdf

Remove the base model deletion bug

763fead

ModelAdapter will take the response

This was referenced Sep 30, 2024

lora model not load successfully but the ModelAdapter status is running #235

Closed

Lora model will lost when LLM pod restart #236

Closed

Model adapter is unloaded after sometime #214

Closed

Jeffwan commented Sep 30, 2024

View reviewed changes

config/gateway/kustomization.yaml Show resolved Hide resolved

Jeffwan commented Sep 30, 2024

View reviewed changes

Makefile Show resolved Hide resolved

varungup90 reviewed Sep 30, 2024

View reviewed changes

varungup90 approved these changes Sep 30, 2024

View reviewed changes

Jeffwan mentioned this pull request Sep 30, 2024

Introduce wait time or retry before moving adapter to another pod if that pod is not ready #258

Open

Jeffwan merged commit b8bbe6d into main Sep 30, 2024
10 checks passed

Jeffwan deleted the jiaxin/bug-bash-improvement branch September 30, 2024 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve model adapter reliability and stability #257

Improve model adapter reliability and stability #257

Jeffwan commented Sep 30, 2024 •

edited

Loading

varungup90 Sep 30, 2024

Jeffwan Sep 30, 2024

varungup90 commented Sep 30, 2024

Jeffwan commented Sep 30, 2024 •

edited

Loading

Jeffwan commented Sep 30, 2024

Improve model adapter reliability and stability #257

Improve model adapter reliability and stability #257

Conversation

Jeffwan commented Sep 30, 2024 • edited Loading

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

varungup90 Sep 30, 2024

Choose a reason for hiding this comment

Jeffwan Sep 30, 2024

Choose a reason for hiding this comment

varungup90 commented Sep 30, 2024

Jeffwan commented Sep 30, 2024 • edited Loading

Jeffwan commented Sep 30, 2024

Jeffwan commented Sep 30, 2024 •

edited

Loading

Jeffwan commented Sep 30, 2024 •

edited

Loading