Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix and load aware routing with radix tree kv cache #719

Merged
merged 1 commit into from
Feb 22, 2025

Conversation

gangmuk
Copy link
Collaborator

@gangmuk gangmuk commented Feb 20, 2025

NOTE: PR #678 should be merged first. Will remove [WIP] in the PR title once PR #678 is merged.

This PR includes a new routing algorithm in gateway. It considers prefix and load together to route requests.
At high level, the algorithm is

  1. Find the prefix matching in the radix tree for the incoming request
  2. Check if the matched prefix has significant length of tokens. For now, the threshold for significant prefix matching is 50%. In other words, if the matching
  3. If the matched prefix has significant length of tokens, then do prefix-aware routing. Route to the pod that has the longest prefix matching. token length is greater than 50% of the input tokens, then it will do prefix-aware routing.
  4. If the matched prefix does not significant length of tokens, then do more sophisticated load-aware routing. It has its own way of calculating load. It estimates prefill latency, decode latency of each pod based on hardware, model, token length.

This implementation is based on the existing prefix aware routing work, Preble.

This new routing algorithm uses radix tree for kv cache data structure unlike the current prefix aware routing (prefix_cache.go) which is using hash.

Related Issues

Resolves: #682

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 20, 2025

you can rebase the changes now

@gangmuk gangmuk force-pushed the gangmuk/prefix_and_load_aware_routing branch 5 times, most recently from 9032868 to 759377b Compare February 21, 2025 02:53
@gangmuk gangmuk requested review from varungup90 and Jeffwan and removed request for varungup90 and Jeffwan February 21, 2025 04:56
@gangmuk gangmuk changed the title [WIP] prefix and load aware routing with radix tree kv cache Prefix and load aware routing with radix tree kv cache Feb 21, 2025
@gangmuk
Copy link
Collaborator Author

gangmuk commented Feb 21, 2025

@varungup90 PTAL

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 21, 2025

@gangmuk please fix the linter and DCO failure.

@gangmuk gangmuk force-pushed the gangmuk/prefix_and_load_aware_routing branch 3 times, most recently from 73478cf to b4bd80c Compare February 22, 2025 00:18
@gangmuk
Copy link
Collaborator Author

gangmuk commented Feb 22, 2025

@Jeffwan The failing file is test/e2e/model_adapter_test.go. it should be done in a different PR and I just saw that it was covered in another PR.

Can we make this PR merged by today? so that I can do some more stuff during the weekend on top of it. I want to avoid making another PR before this to be merged. This PR has been here a bit long.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 22, 2025

@gangmuk I already merge #703, can you rebase the main and fix the linter issue?

Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me. I didn't check all implementation details, especially the data structure part. We can merge and evaluate it later

@@ -35,17 +37,21 @@ type TreeNode struct {
parent *TreeNode
value []int
key []int
refCounter []int
RefCounter []int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to expose such fields? I do see you have some GetXXX methods, we can have some unified experiences later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Will make it consistent

SeqLens []int
}

func mistral7BA6000LinearTime(numBatchedTokens int) float64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did preble repo share some codes to generate such magic numbers? If no and our target model and GPU are not in their list, what's the workaround?

Let's document the limitations or assumptions here. You can do this later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no they didn't. those numbers were there all hardcoded. They use linear regression model. I took them from their code and based on the hardware spec gap between a6000 and v100. I created v100 one.

sounds good. I will write more comment about this limitation on the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just FYI, we can talk with Preble authors and Prof Yiying on this work if there're blockers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan yeah. I don't think there is a secret sauce (not 100% sure though). Let me run the benchmark and we can discuss with numbers. will ping you once it is done

they profiled the target model in a6000 with varying input/output like we did in the gpu optimizer. I don't know the details of either profiling though.

)

const (
defaultDecodingLength = 45 // FIXME: decode length is hardcoded. Preble as well.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this came from the preble code. they used the hardcoded output len.I took it as is for now. It needs to be more sophisticated. I will do it in a separate PR if that's okay

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you.

defaultDecodingLength = 45 // FIXME: decode length is hardcoded. Preble as well.
slidingWindowPeriod = 3 * time.Minute // NOTE: hardcoded
evictionLoopInterval = 1000 * time.Millisecond // NOTE: hardcoded
targetGPU = "V100" // A6000 // FIXME: make it configurable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a discussion later on the limitation of current implementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sounds good.

- Initial implementation of radix tree-based cache with prefix matching
- Added routing strategy in gateway for prefix-cache-and-load
- Updated tree.go implementation (GPU -> Pod)
- Implemented sophisticated prefill time cost computation for V100
- Added attention quadratic cost calculation
- Fixed bugs in SplitNode and evictNode functionality
- Added proper ModelToPods mapping propagation
- Support for dynamic pod changes
- Optimized longest prefix matching logic
- Updated package paths and cleaned up deprecated functions

Signed-off-by: Gangmuk Lim <[email protected]>
@gangmuk gangmuk force-pushed the gangmuk/prefix_and_load_aware_routing branch from b4bd80c to 62d416f Compare February 22, 2025 06:08
@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 22, 2025

/lgtm

@Jeffwan Jeffwan merged commit d350ca5 into main Feb 22, 2025
11 checks passed
@Jeffwan Jeffwan deleted the gangmuk/prefix_and_load_aware_routing branch February 22, 2025 06:29
@gangmuk
Copy link
Collaborator Author

gangmuk commented Feb 22, 2025

@Jeffwan I will make another PR that resolves the feedback you gave.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 22, 2025

@gangmuk Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

a new prefix "and" load aware routing logic implementation with RadixTree
2 participants