Merge pull request #103 from huggingface/feat/autoscaling-pending-req

update according to new autoscaling strategy
huggingface · Nov 5, 2024 · d5c1a81 · d5c1a81
2 parents bde4837 + 3e7cf7f
commit d5c1a81
Showing 1 changed file with 8 additions and 0 deletions.
diff --git a/docs/source/autoscaling.mdx b/docs/source/autoscaling.mdx
@@ -12,6 +12,14 @@ The autoscaling process is triggered based on the accelerator's utilization metr
 
 It's important to note that the scaling up process takes place every minute and scaling down takes place every 2 minutes. This frequency ensures a balance between responsiveness and stability of the autoscaling system, with a stabilization of 300 seconds once scaled down. 
 
+### Scaling based on pending requests (beta feature)
+
+You can change the scaling criteria to be based on pending requests instead of utilization metrics. This is currently an experimental feature, so it may change, and we don't recommend using it for production workloads.
+
+- pending requests are requests that have not yet received an HTTP status, meaning they include in-flight requests and requests currently being processed.
+- if there are more than 1.5 pending requests per replica in the past 20 seconds, it triggers an autoscaling event and adds a replica to your deployment.
+
+
 ## Considerations for Effective Autoscaling
 
 While autoscaling offers convenient resource management, certain considerations should be kept in mind to ensure its effectiveness: