Skip to content

Commit

Permalink
Merge pull request #103 from huggingface/feat/autoscaling-pending-req
Browse files Browse the repository at this point in the history
update according to new autoscaling strategy
  • Loading branch information
ErikKaum authored Nov 5, 2024
2 parents bde4837 + 3e7cf7f commit d5c1a81
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions docs/source/autoscaling.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@ The autoscaling process is triggered based on the accelerator's utilization metr

It's important to note that the scaling up process takes place every minute and scaling down takes place every 2 minutes. This frequency ensures a balance between responsiveness and stability of the autoscaling system, with a stabilization of 300 seconds once scaled down.

### Scaling based on pending requests (beta feature)

You can change the scaling criteria to be based on pending requests instead of utilization metrics. This is currently an experimental feature, so it may change, and we don't recommend using it for production workloads.

- pending requests are requests that have not yet received an HTTP status, meaning they include in-flight requests and requests currently being processed.
- if there are more than 1.5 pending requests per replica in the past 20 seconds, it triggers an autoscaling event and adds a replica to your deployment.


## Considerations for Effective Autoscaling

While autoscaling offers convenient resource management, certain considerations should be kept in mind to ensure its effectiveness:
Expand Down

0 comments on commit d5c1a81

Please sign in to comment.