-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix and load aware routing with radix tree kv cache #719
Conversation
you can rebase the changes now |
9032868
to
759377b
Compare
@varungup90 PTAL |
@gangmuk please fix the linter and DCO failure. |
73478cf
to
b4bd80c
Compare
@Jeffwan The failing file is Can we make this PR merged by today? so that I can do some more stuff during the weekend on top of it. I want to avoid making another PR before this to be merged. This PR has been here a bit long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me. I didn't check all implementation details, especially the data structure part. We can merge and evaluate it later
@@ -35,17 +37,21 @@ type TreeNode struct { | |||
parent *TreeNode | |||
value []int | |||
key []int | |||
refCounter []int | |||
RefCounter []int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to expose such fields? I do see you have some GetXXX
methods, we can have some unified experiences later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Will make it consistent
SeqLens []int | ||
} | ||
|
||
func mistral7BA6000LinearTime(numBatchedTokens int) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did preble repo share some codes to generate such magic numbers? If no and our target model and GPU are not in their list, what's the workaround?
Let's document the limitations or assumptions here. You can do this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no they didn't. those numbers were there all hardcoded. They use linear regression model. I took them from their code and based on the hardware spec gap between a6000 and v100. I created v100 one.
sounds good. I will write more comment about this limitation on the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just FYI, we can talk with Preble authors and Prof Yiying on this work if there're blockers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jeffwan yeah. I don't think there is a secret sauce (not 100% sure though). Let me run the benchmark and we can discuss with numbers. will ping you once it is done
they profiled the target model in a6000 with varying input/output like we did in the gpu optimizer. I don't know the details of either profiling though.
) | ||
|
||
const ( | ||
defaultDecodingLength = 45 // FIXME: decode length is hardcoded. Preble as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this came from the preble code. they used the hardcoded output len.I took it as is for now. It needs to be more sophisticated. I will do it in a separate PR if that's okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you.
defaultDecodingLength = 45 // FIXME: decode length is hardcoded. Preble as well. | ||
slidingWindowPeriod = 3 * time.Minute // NOTE: hardcoded | ||
evictionLoopInterval = 1000 * time.Millisecond // NOTE: hardcoded | ||
targetGPU = "V100" // A6000 // FIXME: make it configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a discussion later on the limitation of current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sounds good.
- Initial implementation of radix tree-based cache with prefix matching - Added routing strategy in gateway for prefix-cache-and-load - Updated tree.go implementation (GPU -> Pod) - Implemented sophisticated prefill time cost computation for V100 - Added attention quadratic cost calculation - Fixed bugs in SplitNode and evictNode functionality - Added proper ModelToPods mapping propagation - Support for dynamic pod changes - Optimized longest prefix matching logic - Updated package paths and cleaned up deprecated functions Signed-off-by: Gangmuk Lim <[email protected]>
b4bd80c
to
62d416f
Compare
/lgtm |
@Jeffwan I will make another PR that resolves the feedback you gave. |
@gangmuk Great! |
NOTE: PR #678 should be merged first. Will remove [WIP] in the PR title once PR #678 is merged.
This PR includes a new routing algorithm in gateway. It considers prefix and load together to route requests.
At high level, the algorithm is
This implementation is based on the existing prefix aware routing work, Preble.
This new routing algorithm uses radix tree for kv cache data structure unlike the current prefix aware routing (prefix_cache.go) which is using hash.
Related Issues
Resolves: #682
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]
: Corrections to existing functionality[CI]
: Changes to build process or CI pipeline[Docs]
: Updates or additions to documentation[API]
: Modifications to aibrix's API or interface[CLI]
: Changes or additions to the Command Line Interface[Misc]
: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.