target latency docs

bentoml · Mar 2, 2023 · 8b72567 · 8b72567
1 parent 765c4b8
commit 8b72567
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 7 deletions.
diff --git a/docs/source/concepts/runner.rst b/docs/source/concepts/runner.rst
@@ -292,15 +292,17 @@ Runner Definition
 
             # below are also configurable via config file:
 
-            # default configs:
-            max_batch_size=..  # default max batch size will be applied to all run methods, unless override in the runnable_method_configs
-            max_latency_ms=.. # default max latency will be applied to all run methods, unless override in the runnable_method_configs
+            # default configs, which will be applied to all run methods, unless overriden for a specific method:
+            max_batch_size=..
+            max_latency_ms=..
+            target_latency_ms=..
 
             runnable_method_configs=[
                 {
                     method_name="predict",
                     max_batch_size=..,
                     max_latency_ms=..,
+                    target_latency_ms=..,
                 }
             ],
         )
@@ -333,6 +335,7 @@ To explicitly disable or control adaptive batching behaviors at runtime, configu
               enabled: true
               max_batch_size: 100
               max_latency_ms: 500
+              target_latency_ms: 50
 
     .. tab-item:: Individual Runner
         :sync: individual_runner
@@ -346,6 +349,7 @@ To explicitly disable or control adaptive batching behaviors at runtime, configu
                  enabled: true
                  max_batch_size: 100
                  max_latency_ms: 500
+                 target_latency_ms: 50
 
 Resource Allocation
 ^^^^^^^^^^^^^^^^^^^

diff --git a/docs/source/guides/batching.rst b/docs/source/guides/batching.rst
@@ -52,18 +52,39 @@ In addition to declaring model as batchable, batch dimensions can also be config
 Configuring Batching
 --------------------
 
-If a model supports batching, adaptive batching is enabled by default. To explicitly disable or control adaptive batching behaviors at runtime, configuration can be specified under the ``batching`` key.
-Additionally, there are two configurations for customizing batching behaviors, `max_batch_size` and `max_latency_ms`.
+If a model supports batching, adaptive batching is enabled by default. To explicitly disable or
+control adaptive batching behaviors at runtime, configuration can be specified under the
+``batching`` key.  Additionally, there are three configuration keys for customizing batching
+behaviors, ``max_batch_size``, ``max_latency_ms``, and ``target_latency_ms``.
 
 Max Batch Size
 ^^^^^^^^^^^^^^
 
-Configured through the ``max_batch_size`` key, max batch size represents the maximum size a batch can reach before releasing for inferencing. Max batch size should be set based on the capacity of the available system resources, e.g. memory or GPU memory.
+Configured through the ``max_batch_size`` key, max batch size represents the maximum size a batch
+can reach before being released for inferencing. Max batch size should be set based on the capacity
+of the available system resources, e.g. memory or GPU memory.
 
 Max Latency
 ^^^^^^^^^^^
 
-Configured through the ``max_latency_ms`` key, max latency represents the maximum latency in milliseconds that a batch should wait before releasing for inferencing. Max latency should be set based on the service level objective (SLO) of the inference requests.
+Configured through the ``max_latency_ms`` key, max latency represents the maximum latency in
+milliseconds that the scheduler will attempt to uphold by cancelling requests when it thinks the
+runner server is incapable of servicing that request in time. Max latency should be set based on the
+service level objective (SLO) of the inference requests.
+
+Target Latency
+^^^^^^^^^^^^^^
+
+Configured through the ``target_latency_ms`` key, target latency represents the latency that the
+request scheduler will try to meet if possible---that is, if target latency is set to 1000ms, and
+the scheduler thinks a batch will take 200ms to execute, it will wait for around 800ms for
+additional requests to arrive. Note that this can be set to 0 in order to disable waiting entirely.
+
+If unset or -1, the scheduler will intelligently choose the wait time based on the historical wait
+time for previous batches.
+
+Target latency should be set based on how long you wish for requests to be held before they are
+executed.
 
 .. code-block:: yaml
     :caption: ⚙️ `configuration.yml`
@@ -74,6 +95,7 @@ Configured through the ``max_latency_ms`` key, max latency represents the maximu
                 enabled: true
                 max_batch_size: 100
                 max_latency_ms: 500
+                target_latency_ms: ~
 
 Monitoring
 ----------