Prefix and load aware routing with radix tree kv cache #719

gangmuk · 2025-02-20T09:45:06Z

NOTE: PR #678 should be merged first. Will remove [WIP] in the PR title once PR #678 is merged.

This PR includes a new routing algorithm in gateway. It considers prefix and load together to route requests.
At high level, the algorithm is

Find the prefix matching in the radix tree for the incoming request
Check if the matched prefix has significant length of tokens. For now, the threshold for significant prefix matching is 50%. In other words, if the matching
If the matched prefix has significant length of tokens, then do prefix-aware routing. Route to the pod that has the longest prefix matching. token length is greater than 50% of the input tokens, then it will do prefix-aware routing.
If the matched prefix does not significant length of tokens, then do more sophisticated load-aware routing. It has its own way of calculating load. It estimates prefill latency, decode latency of each pod based on hardware, model, token length.

This implementation is based on the existing prefix aware routing work, Preble.

This new routing algorithm uses radix tree for kv cache data structure unlike the current prefix aware routing (prefix_cache.go) which is using hash.

Related Issues

Resolves: #682

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Jeffwan · 2025-02-20T21:48:57Z

you can rebase the changes now

gangmuk · 2025-02-21T19:07:58Z

@varungup90 PTAL

Jeffwan · 2025-02-21T23:35:31Z

@gangmuk please fix the linter and DCO failure.

gangmuk · 2025-02-22T00:29:42Z

@Jeffwan The failing file is test/e2e/model_adapter_test.go. it should be done in a different PR and I just saw that it was covered in another PR.

Can we make this PR merged by today? so that I can do some more stuff during the weekend on top of it. I want to avoid making another PR before this to be merged. This PR has been here a bit long.

Jeffwan · 2025-02-22T05:39:52Z

@gangmuk I already merge #703, can you rebase the main and fix the linter issue?

Jeffwan

overall looks good to me. I didn't check all implementation details, especially the data structure part. We can merge and evaluate it later

Jeffwan · 2025-02-22T05:41:18Z

pkg/plugins/gateway/prefixcacheindexer/tree.go

@@ -35,17 +37,21 @@ type TreeNode struct {
 	parent        *TreeNode
 	value         []int
 	key           []int
-	refCounter    []int
+	RefCounter    []int


do you need to expose such fields? I do see you have some GetXXX methods, we can have some unified experiences later

makes sense. Will make it consistent

Jeffwan · 2025-02-22T05:43:57Z

pkg/plugins/gateway/algorithms/prefix_cache_and_load.go

+	SeqLens          []int
+}
+
+func mistral7BA6000LinearTime(numBatchedTokens int) float64 {


Did preble repo share some codes to generate such magic numbers? If no and our target model and GPU are not in their list, what's the workaround?

Let's document the limitations or assumptions here. You can do this later

no they didn't. those numbers were there all hardcoded. They use linear regression model. I took them from their code and based on the hardware spec gap between a6000 and v100. I created v100 one.

sounds good. I will write more comment about this limitation on the code.

just FYI, we can talk with Preble authors and Prof Yiying on this work if there're blockers.

@Jeffwan yeah. I don't think there is a secret sauce (not 100% sure though). Let me run the benchmark and we can discuss with numbers. will ping you once it is done

they profiled the target model in a6000 with varying input/output like we did in the gpu optimizer. I don't know the details of either profiling though.

Jeffwan · 2025-02-22T05:44:19Z

pkg/plugins/gateway/algorithms/prefix_cache_and_load.go

+)
+
+const (
+	defaultDecodingLength = 45                      // FIXME: decode length is hardcoded. Preble as well.


what's this for?

again, this came from the preble code. they used the hardcoded output len.I took it as is for now. It needs to be more sophisticated. I will do it in a separate PR if that's okay

Jeffwan · 2025-02-22T05:44:46Z

pkg/plugins/gateway/algorithms/prefix_cache_and_load.go

+	defaultDecodingLength = 45                      // FIXME: decode length is hardcoded. Preble as well.
+	slidingWindowPeriod   = 3 * time.Minute         // NOTE: hardcoded
+	evictionLoopInterval  = 1000 * time.Millisecond // NOTE: hardcoded
+	targetGPU             = "V100"                  // A6000 // FIXME: make it configurable


We can have a discussion later on the limitation of current implementation

yeah sounds good.

- Initial implementation of radix tree-based cache with prefix matching - Added routing strategy in gateway for prefix-cache-and-load - Updated tree.go implementation (GPU -> Pod) - Implemented sophisticated prefill time cost computation for V100 - Added attention quadratic cost calculation - Fixed bugs in SplitNode and evictNode functionality - Added proper ModelToPods mapping propagation - Support for dynamic pod changes - Optimized longest prefix matching logic - Updated package paths and cleaned up deprecated functions Signed-off-by: Gangmuk Lim <[email protected]>

Jeffwan · 2025-02-22T06:29:28Z

/lgtm

gangmuk · 2025-02-22T06:31:48Z

@Jeffwan I will make another PR that resolves the feedback you gave.

Jeffwan · 2025-02-22T06:32:46Z

@gangmuk Great!

gangmuk force-pushed the gangmuk/prefix_and_load_aware_routing branch 5 times, most recently from 9032868 to 759377b Compare February 21, 2025 02:53

gangmuk requested review from varungup90 and Jeffwan and removed request for varungup90 and Jeffwan February 21, 2025 04:56

gangmuk changed the title ~~[WIP] prefix and load aware routing with radix tree kv cache~~ Prefix and load aware routing with radix tree kv cache Feb 21, 2025

gangmuk force-pushed the gangmuk/prefix_and_load_aware_routing branch 3 times, most recently from 73478cf to b4bd80c Compare February 22, 2025 00:18

Jeffwan reviewed Feb 22, 2025

View reviewed changes

gangmuk force-pushed the gangmuk/prefix_and_load_aware_routing branch from b4bd80c to 62d416f Compare February 22, 2025 06:08

Jeffwan merged commit d350ca5 into main Feb 22, 2025
11 checks passed

Jeffwan deleted the gangmuk/prefix_and_load_aware_routing branch February 22, 2025 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix and load aware routing with radix tree kv cache #719

Prefix and load aware routing with radix tree kv cache #719

gangmuk commented Feb 20, 2025 •

edited

Loading

Jeffwan commented Feb 20, 2025

gangmuk commented Feb 21, 2025

Jeffwan commented Feb 21, 2025

gangmuk commented Feb 22, 2025 •

edited

Loading

Jeffwan commented Feb 22, 2025

Jeffwan left a comment

Jeffwan Feb 22, 2025

gangmuk Feb 22, 2025

Jeffwan Feb 22, 2025

gangmuk Feb 22, 2025

Jeffwan Feb 22, 2025

gangmuk Feb 22, 2025

Jeffwan Feb 22, 2025

gangmuk Feb 22, 2025

Jeffwan Feb 22, 2025

Jeffwan Feb 22, 2025

gangmuk Feb 22, 2025

Jeffwan commented Feb 22, 2025

gangmuk commented Feb 22, 2025

Jeffwan commented Feb 22, 2025

Prefix and load aware routing with radix tree kv cache #719

Prefix and load aware routing with radix tree kv cache #719

Conversation

gangmuk commented Feb 20, 2025 • edited Loading

Related Issues

Pull Request Title Format

Submission Checklist

Jeffwan commented Feb 20, 2025

gangmuk commented Feb 21, 2025

Jeffwan commented Feb 21, 2025

gangmuk commented Feb 22, 2025 • edited Loading

Jeffwan commented Feb 22, 2025

Jeffwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Feb 22, 2025

gangmuk commented Feb 22, 2025

Jeffwan commented Feb 22, 2025

gangmuk commented Feb 20, 2025 •

edited

Loading

gangmuk commented Feb 22, 2025 •

edited

Loading