Automatically generated release for tag v0.2.0.
🚀 New Features Highlights
- Distributed KV Cache: Implemented support for managing KV cache across multiple nodes, enhancing performance.
- Cost-Driven Heterogenous Serving: Improved scheduling and inference strategies for mixed GPU environments, optimizing cost and resource utilization. (#371 #430, #509, #598, #554, #598)
- Optimizer Based Autoscaling: Leverage offline profiles of inference server to calculate the number of replicas. (#430, #500, #692, #508)
- Prefix Cache Aware Routing: Added support for routing decisions based on prefix cache hits, improving inference efficiency. (#641, #657)
📊 Feature Enhancements
- LoRA Scheduling Enhancements: Introduced multiple scheduling strategies, including bin packing, least latency, least throughput, and random. (#544)
- Prefix Cache Aware Routing: Added support for routing decisions based on prefix cache hits, improving inference efficiency. (#641)
- Gateway Enhancements: Improved request handling efficiency by enabling streaming in the Envoy gateway. (#377) Enhanced the handling of model registration and invalid cache scenarios. (#542), Introduced fallback strategies to ensure robust request allocation. (#445) Optimized cache store retrieval, reducing unnecessary overhead. (#639) Addressed missing Prometheus config preventing gateway startup. (#441)
- PodAutoscaler Scaling improvements: Improved scaling logic to handle edge cases more efficiently. (#508, #515)
🛠Infrastructure & CI/CD Upgrades
- Parallelized Build Tasks: CI efficiency improvements by running builds in parallel. (#398)
- CrashLoopBackOff Detection in CI: Added monitoring for pod failures in testing workflows. (#444)
- Improved GitHub Actions Cost Efficiency: Optimized triggers and removed unnecessary nightly builds. (#411, #422)
- Integration Tests for Core Components: Added integration tests for autoscalers, routing policies, and deployment configurations. (#616, #620)
What's Changed
- Add envoy gateway streaming support by @varungup90 in #377
- Add client traffic policy to increase per connection buffer size from 32kb to 256kb by @varungup90 in #395
- Misc: add support to metricsSources property of podautoscaler by @zhangjyr in #371
- [Misc] Update runtime server startup command in v0.1.0 by @brosoul in #396
- [CI] improve the ci efficiency by parallelizing the build tasks by @nwangfw in #398
- Fix the ticker interval by removing unnecessary ms by @Jeffwan in #415
- [Misc] Disable specific endpoints logs by @Jeffwan in #418
- [CI] Github Action trigger condition optimized for cost saving by @nwangfw in #411
- [Misc] Fix the mocked app role permission issue by @Jeffwan in #416
- [CI] Nightly tag removed for release branch by @nwangfw in #422
- Enable setting PodAutoscaler configuration via YAML labels by @kr11 in #409
- Update manifest to adopt v0.1.1 images by @Jeffwan in #429
- [Bug]: duplicated http in rest metrics fetcher (#408) by @zhangjyr in #421
- [MISC]: Improve Request Trace Granularity with Version Control by @zhangjyr in #431
- Support histogram metrics from engine in cache by @Jeffwan in #424
- Support fetching metrics from remote Prometheus server by @Jeffwan in #433
- [CI] Add python wheel to release artifact by @Jeffwan in #434
- Fix update cache pod issue and refactor updatePod handler by @Jeffwan in #439
- Extract common metrics structure to types and utils by @Jeffwan in #438
- Fix gateway startup issue due to missing prometheus config by @Jeffwan in #441
- [feat]: GPU Optimizer and Simulator development app by @zhangjyr in #430
- Add selectrandom fallback in routing and only scraping healthy pods by @Jeffwan in #445
- AIBrix Workload Generator / Scenario Simulator by @happyandslow in #428
- CrashLoopBackOff status detection in CI by @nwangfw in #444
- Support installing individual controllers from giant controller-manager by @nwangfw in #442
- Refactor Scaler: Resolve Issues with Metric Parameter Updates in Multiple KPAs by @kr11 in #437
- Support metrics multi labels for different models by @brosoul in #450
- Add health check api interface for runtime by @Jeffwan in #451
- Fix the service name override issue in rolebindings by @Jeffwan in #453
- Reorganize docs/development and docs/tutorial structure by @Jeffwan in #455
- Move tools to separate folders and update mocked app README.md by @Jeffwan in #457
- Fix multi models metric result in PromQL by @brosoul in #458
- Support Azure LLM trace in workload generator by @happyandslow in #462
- Fix autoscaler scalingstrategy switching logic by @nwangfw in #475
- Fix missing handle of PromQL scope is PodMetricScope by @brosoul in #479
- [Misc] Consolidate app and simulator by @zhangjyr in #477
- [Bug] Avoid including sensitive info in Dockerfile ENV by @zhangjyr in #487
- Refactor generator to generate time-based traces by @happyandslow in #478
- [CI] Update deploy workload script in installation test by @nwangfw in #499
- [Bug] handle metricKey creation with MetricsSources by @nwangfw in #498
- Adding Client for Workload Generator Workload File by @happyandslow in #501
- [Feat] Integrate deployment configurations and fix autoscaler/gpu optimizer connectivity by @zhangjyr in #500
- Fix some simulator format issue and add some TODOs by @Jeffwan in #505
- [Bug] Fix the way how podautoscaler handle 0 pods. by @zhangjyr in #508
- [Misc] Improve gpu optimizer debugging on podautoscaler. by @zhangjyr in #509
- Optimize kustomize overlay for volcano engine deployment by @Jeffwan in #512
- [perf] Refact tos downloader in Runtime by @brosoul in #510
- Refactor metric source for customized protocol, port and path by @kr11 in #511
- [Bug] Fixed the yaml of deployments in heterogenous GPU settings to make KPA scaling work as expected. by @zhangjyr in #513
- [Misc] Heterogeneous GPU Optimizer Logging Clean Up by @nwangfw in #514
- Fix KPA bug, and an elaborate KPA test case by @kr11 in #515
- Cut v0.2.0-rc.1 release by @Jeffwan in #516
- [Bug] Accumulated bug fix on controller manager, mock app configuration, and gpu optimizer. by @zhangjyr in #522
- [Misc] Reduced runtime's container image size by @nwangfw in #518
- clean memory scaler object when pa crd is deleted by @kr11 in #520
- Configure autoscaler http client to skip certificate check by @Jeffwan in #530
- [Doc] Update aibrix documentation by @Jeffwan in #533
- Refactor the gateway-plugin and metadata service manifests by @Jeffwan in #531
- Fix the GITHUB_WORKSPACE artifact sharing issue in release workflow by @Jeffwan in #532
- [Misc] Polish the benchmark scripts by @Jeffwan in #525
- Fix APA bugs in creation, add test and demo yaml by @kr11 in #536
- Add VKE IPv4 Testing Cluster Config by @nwangfw in #537
- Support for request length internal trace by @happyandslow in #538
- [Feat] Add download status into runtime downloader by @brosoul in #539
- [Feat] Add runtime model management api by @brosoul in #540
- [gateway] handle the wrong model name and cache inconsistency case by @Jeffwan in #542
- [Docs] fix: update the parameters instruction in readme by @scarlet25151 in #548
- add lora schedulers - bin pack, least latency, least throughput, random by @Aspirin96 in #544
- add request routers - least kv cache, least expected latency by @Aspirin96 in #543
- [Docs] heterogenous gpu docs added by @nwangfw in #545
- Fix race condition in cache by @varungup90 in #550
- Fix pod internal cache delete handling by @varungup90 in #552
- Handle terminating pod for request routing by @varungup90 in #549
- Support absolute path as lora adapter artifact path by @Jeffwan in #556
- Deadlock fix for cache by @varungup90 in #557
- Mock app log fix for missing metrics warning by @varungup90 in #564
- Add vllm graceful termination configuration by @nwangfw in #568
- Enhance dynamic lora adapter support for auth enabled scenario by @Jeffwan in #571
- Update pyproject.toml to support python 3.12 by @Jeffwan in #579
- [Docs ]Update ai runtime management api and downloader docs by @Jeffwan in #577
- Check the HPA ownerReference in request enqueue by @Jeffwan in #582
- Add request length for traces by @happyandslow in #569
- Support model registration flow using aibrix runtime api by @Jeffwan in #580
- Gateway plugin report total incoming requests and pending requests by @zhangjyr in #554
- Support distributed kv cache orchestration by @Jeffwan in #583
- Grant workflow action permission to write packages by @Jeffwan in #586
- Update routers to use GetPodModelMetric api and misc cleanup in metri… by @varungup90 in #590
- Update upload/download artifact github actions version to v4 by @varungup90 in #591
- Update version in aibrix/python to 0.2.0-rc.2 by @varungup90 in #594
- Update image names in sync-image script by @varungup90 in #595
- Update dependency chart for release pipeline by @varungup90 in #597
- Patch release for older vllm engine lora support in gateway plugins by @varungup90 in #599
- Update component names in staging deployment and readme for new relea… by @varungup90 in #605
- Fix the PodAutoscaler kind typo by @Jeffwan in #610
- Improve condition update and fix multiple endpoint ips issue by @Jeffwan in #609
- Check if model name is present in response from inference engine by @varungup90 in #611
- Update log level for few messages in PodAutoscaler by @varungup90 in #612
- [enhancement] GPU optimizer accumulated fix by @zhangjyr in #598
- Update manifest to use v0.2.0-rc.2 tag by @Jeffwan in #614
- Add framework to setup integration test by @varungup90 in #616
- [docs] Update lora model adapter docs by @Jeffwan in #618
- [docs] Update AI Engine Runtime and Fleet docs by @Jeffwan in #619
- [Doc] update quickstart tutorial and add example sending requests via gatew… by @nwangfw in #621
- [Doc] feature description for distributed kv cache by @DwyaneShi in #623
- WIP: Add docs gateway plugin by @varungup90 in #624
- [Docs] Update GPU Optimizer documentation by @zhangjyr in #622
- Add integration test to CI workflow by @varungup90 in #620
- [Docs] Updated autoscaling doc by @gangmuk in #625
- Filter active pods before metrics calculation by @Jeffwan in #629
- Fix some issues in the docs and polish contents by @Jeffwan in #630
- Ignore Jupyter notebooks for GitHub Linguist by @Jeffwan in #632
- [Docs] Improving the heterogenous-GPU feature doc by @nwangfw in #634
- [Doc] Fixed autoscaling doc by @gangmuk in #635
- Fix out of space error in running integration test github workflow by @varungup90 in #628
- Use AIBRIX_POD_METRIC_REFRESH_INTERVAL_MS=50 in base configs by @Jeffwan in #640
- Fix the least-kv-cache cache store retrieval by @Jeffwan in #639
- Add prefix cache aware routing by @varungup90 in #641
- [misc] Polish gateway code with better structure by @Jeffwan in #645
- Create AIBrix Single-Node Deployment on Lambda scripts by @Jeffwan in #659
- End-to-end benchmark pipeline for autoscalers and routing policies by @gangmuk in #650
- Clean up scripts under hack folder by @Jeffwan in #660
- Add a research section, update architecture and lambda guidance by @Jeffwan in #663
- Leverage literalinclude to keep only one code copy and move autoscaler configs to annotations by @Jeffwan in #665
- Updated scripts and fixed issues in benchmark/autoscaling by @gangmuk in #662
- Benchmark Generator Refactoring by @happyandslow in #655
- Add interface for prefix cache indexer by @varungup90 in #657
- Fix missing file to generator refactoring by @happyandslow in #670
- [Bug] GPU optimizer bug fix and document fix by @zhangjyr in #656
- Change error response to json and improve e2e stability by @Jeffwan in #669
- Use response buffer to address stream request issue by @Jeffwan in #679
- [docs] Polish feature examples and user guidances by @Jeffwan in #686
- Update version and tags to v0.2.0 by @Jeffwan in #687
- fix api scheme by @kerthcet in #674
- [docs] Polish distributed inference and kv cache examples by @Jeffwan in #691
- Improve lora autoscaling and kvcache examples by @Jeffwan in #697
- [Docs] Add optimizer-based autoscaling doc and examples by @nwangfw in #692
- Add cpu/memory resources for control plane components by @varungup90 in #702
- Update log config for sample deployments by @varungup90 in #704
- [Docs] Add feature description of dist kv cache in README by @DwyaneShi in #705
- [Docs] Update README.md by @Jeffwan in #706
- Add feature description for heterogeneous gpu inference feature by @nwangfw in #707
- Bump py version to 0.2.0.post1 by @Jeffwan in #708
- Fix wrong path for generated html by @kerthcet in #709
New Contributors
- @scarlet25151 made their first contribution in #548
- @Aspirin96 made their first contribution in #544
- @DwyaneShi made their first contribution in #623
- @gangmuk made their first contribution in #625
- @kerthcet made their first contribution in #674
Full Changelog: v0.1.0...v0.2.0