Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding some initial docs content and diagrams #129

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added site-src/images/request-flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added site-src/images/resource-model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
82 changes: 81 additions & 1 deletion site-src/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,83 @@
# Introduction

TODO
Gateway API Inference Extension is an official Kubernetes project focused on
extending [Gateway API](https://gateway-api.sigs.k8s.io/) with inference
specific routing extensions.

The overall resource model focuses on 2 new inference-focused
[personas](/concepts/roles-and-personas) and corresponding resources that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that roles-and-personas.md is in a TODO state.

they are expected to manage:

<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
<img src="/images/resource-model.png" alt="Gateway API Inference Extension Resource Model" class="center" width="550" />

## API Resources

### InferencePool

### InferenceModel

## Composable Layers

This project aims to develop an ecosystem of implementations that are fully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by "ecosystem of implementations"? This project aims to "define specifications to enable a compatible ecosystem for extending the Gateway API with custom endpoint selection algorithms"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on ^ description from @ahg-g

compatible with each other. There are three distinct layers of components that
are relevant to this project:

### Gateway API Implementations

Gateway API has [more than 25
implementations](https://gateway-api.sigs.k8s.io/implementations/). As this
pattern stabilizes, we expect a wide set of these implementations to support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear what pattern this refers to, perhaps we should say the word pattern in the composable layer section intro so that the reader can connect the dots

this project.

### Endpoint Selection Extension

As part of this project, we're building an initial reference extension that is
focused on routing to LoRA workloads. Over time, we hope to see a wide variety
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to give the impression that this is centered on LoRA, in reality LoRA is just one of multiple criteria for selection.

of extensions emerge that follow this pattern and provide a wide range of
choices.

### Model Server Frameworks

This project will work closely with model server frameworks to establish a
shared standard for interacting with these extensions, particularly focused on
metrics and observability so extensions will be able to make informed routing
decisions. The project is currently focused on integrations with
[vLLM](https://github.com/vllm-project/vllm) and
[Triton](https://github.com/triton-inference-server/server), and will be open to
other integrations as they are requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liu-cong will create a page for the model server protocol. @liu-cong we can start with the narrowest set of requirements: kv-cache, active adapters and queue length; the metric type of each for both Prometheus and ORCA formats.


## Request Flow

To illustrate how this all comes together, it may be helpful to walk through a
sample request.

1. The first step involves the Gateway selecting the the correct InferencePool
(set of endpoints running a model server framework) or Service to route to. This
logic is based on the existing Gateway and HTTPRoute APIs, and will be familiar
to any Gateway API users or implementers.

2. If the request should be routed to an InferencePool, the Gateway will forward
the request information to the endpoint selection extension for that pool.

3. The extension will fetch metrics from whichever portion of the InferencePool
endpoints can best achieve the configured objectives. Note that this kind of
metrics probing may happen asynchronously, depending on the extension.

4. The extension will instruct the Gateway which endpoint should be routed to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/which endpoint should be routed to./which endpoint the request should be routed to./


5. The Gateway will route the request to the desired endpoint.

<img src="/images/request-flow.png" alt="Gateway API Inference Extension Request Flow" class="center" />


## Who is working on Gateway API Inference Extension?

This project is being driven by
[WG-Serving](https://github.com/kubernetes/community/tree/master/wg-serving)
[SIG-Network](https://github.com/kubernetes/community/tree/master/sig-network)
to improve and standardize routing to inference workloads in Kubernetes. Check
out the [implementations reference](implementations.md) to see the latest
projects & products that support this project. If you are interested in
contributing to or building an implementation using Gateway API then don’t
hesitate to [get involved!](/contributing)
1 change: 0 additions & 1 deletion site-src/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,4 @@
img.center {
display: block;
margin: 20px auto;
width: 550px;
}