[WIP] Add GPU Optimizer deployment and update configurations #480

nwangfw · 2024-12-04T16:35:28Z

Pull Request Description

In https://github.com/aibrix/aibrix/pull/430/files, most of the component are dockerized but have not moved to kubernetes environment, we move them under config/default scope and make sure it can be installed along with other aibrix components.

Related Issues

Resolves: #459

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Jeffwan · 2024-12-04T17:26:57Z

config/gpu-optimizer/deployment.yaml

+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: pod-autoscaler


name should change to something related to gpu-optimizer etc

Jeffwan · 2024-12-04T17:28:08Z

config/gpu-optimizer/deployment.yaml

+kind: Role
+metadata:
+  namespace: aibrix-system
+  name: deployment-reader


Similarly, names should be updated

Jeffwan · 2024-12-04T17:29:42Z

development/simulator/deployment-a100.yaml

  maxReplicas: 10
  targetMetric: "avg_prompt_throughput_toks_per_s" # Ignore if metricsSources is configured
  metricsSources: 
-    - endpoint: gpu-optimizer.aibrix-system.svc.cluster.local:8080


these files are being refactored in #477. we probably can change it later.

Jeffwan · 2024-12-04T17:31:47Z

config/gpu-optimizer/deployment.yaml

@@ -0,0 +1,75 @@
+apiVersion: v1


can we break these files into more granular files?

deployment.yaml

rbac.yaml

service.yaml

Jeffwan · 2024-12-04T17:32:45Z

config/gpu-optimizer/deployment.yaml

+    - protocol: TCP
+      port: 8080
+      targetPort: 8080
+      nodePort: 30008


doe it need nortPort?

Jeffwan · 2024-12-04T17:33:37Z

config/gpu-optimizer/deployment.yaml

+      containers:
+      - name: gpu-optimizer
+        image: aibrix/runtime:nightly
+        command: ["python", "-m", "aibrix.gpu_optimizer.app"]


if the server is down, what's the autoscaler behavior? Have you tested such behaviors?

* Move huggingface_token to config.json Add missing zscaler root CA to image for huggingface lib to download tokenizer model successfully. * Remove huggingface token --------- Co-authored-by: Jingyuan Zhang <[email protected]>

* adding timestamp and prompt in/output length to traces * name fix; plotting script fix * update README * addressing comments * addressing comments * add sample workload * add sample workload * update file format * update jsonl option --------- Co-authored-by: Le Xu <[email protected]>

…l namespaces with model label.

zhangjyr · 2024-12-05T23:39:47Z

@nwangfw I fixed k8s access problem on branch issues/484_Controller_failed_to_fetch_metrics_from_MetricSource. I think we should merge changes together.

…_failed_to_fetch_metrics_from_MetricSource # Conflicts: # development/simulator/deployment-a100.yaml # development/simulator/deployment-a40.yaml

zhangjyr · 2024-12-06T00:28:59Z

Sorry, it looks like this PR includes merged changes from the main. Maybe I should start another PR for a clear view.

Jeffwan · 2024-12-06T18:16:25Z

As #494 merged, let's close this PR

nwangfw requested review from zhangjyr and Jeffwan December 4, 2024 16:35

Jeffwan reviewed Dec 4, 2024

View reviewed changes

Add GPU Optimizer deployment and update configurations

90cd690

nwangfw force-pushed the gpu-optimizer-orchestration branch from c881f59 to 90cd690 Compare December 4, 2024 21:14

nwangfw linked an issue Dec 5, 2024 that may be closed by this pull request

Controller failed to fetch metrics from MetricSource #484

Closed

zhangjyr and others added 3 commits December 5, 2024 11:18

Fix k8s accessibility regard namespaces. GPU optimizer now monitor al…

d2be10a

…l namespaces with model label.

Merge branch 'gpu-optimizer-orchestration' into issues/484_Controller…

e544c12

…_failed_to_fetch_metrics_from_MetricSource # Conflicts: # development/simulator/deployment-a100.yaml # development/simulator/deployment-a40.yaml

zhangjyr mentioned this pull request Dec 6, 2024

[feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #492

Merged

Jeffwan closed this Dec 6, 2024

zhangjyr mentioned this pull request Dec 6, 2024

[Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity #500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add GPU Optimizer deployment and update configurations #480

[WIP] Add GPU Optimizer deployment and update configurations #480

nwangfw commented Dec 4, 2024

Jeffwan Dec 4, 2024

Jeffwan Dec 4, 2024

Jeffwan Dec 4, 2024

Jeffwan Dec 4, 2024

Jeffwan Dec 4, 2024

Jeffwan Dec 4, 2024

zhangjyr commented Dec 5, 2024

zhangjyr commented Dec 6, 2024

Jeffwan commented Dec 6, 2024

[WIP] Add GPU Optimizer deployment and update configurations #480

[WIP] Add GPU Optimizer deployment and update configurations #480

Conversation

nwangfw commented Dec 4, 2024

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Jeffwan Dec 4, 2024

Choose a reason for hiding this comment

Jeffwan Dec 4, 2024

Choose a reason for hiding this comment

Jeffwan Dec 4, 2024

Choose a reason for hiding this comment

Jeffwan Dec 4, 2024

Choose a reason for hiding this comment

Jeffwan Dec 4, 2024

Choose a reason for hiding this comment

Jeffwan Dec 4, 2024

Choose a reason for hiding this comment

zhangjyr commented Dec 5, 2024

zhangjyr commented Dec 6, 2024

Jeffwan commented Dec 6, 2024