Skip to content

Commit

Permalink
target latency docs
Browse files Browse the repository at this point in the history
  • Loading branch information
sauyon committed Mar 2, 2023
1 parent 765c4b8 commit 8b72567
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 7 deletions.
10 changes: 7 additions & 3 deletions docs/source/concepts/runner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -292,15 +292,17 @@ Runner Definition
# below are also configurable via config file:
# default configs:
max_batch_size=.. # default max batch size will be applied to all run methods, unless override in the runnable_method_configs
max_latency_ms=.. # default max latency will be applied to all run methods, unless override in the runnable_method_configs
# default configs, which will be applied to all run methods, unless overriden for a specific method:
max_batch_size=..
max_latency_ms=..
target_latency_ms=..
runnable_method_configs=[
{
method_name="predict",
max_batch_size=..,
max_latency_ms=..,
target_latency_ms=..,
}
],
)
Expand Down Expand Up @@ -333,6 +335,7 @@ To explicitly disable or control adaptive batching behaviors at runtime, configu
enabled: true
max_batch_size: 100
max_latency_ms: 500
target_latency_ms: 50
.. tab-item:: Individual Runner
:sync: individual_runner
Expand All @@ -346,6 +349,7 @@ To explicitly disable or control adaptive batching behaviors at runtime, configu
enabled: true
max_batch_size: 100
max_latency_ms: 500
target_latency_ms: 50
Resource Allocation
^^^^^^^^^^^^^^^^^^^
Expand Down
30 changes: 26 additions & 4 deletions docs/source/guides/batching.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,18 +52,39 @@ In addition to declaring model as batchable, batch dimensions can also be config
Configuring Batching
--------------------

If a model supports batching, adaptive batching is enabled by default. To explicitly disable or control adaptive batching behaviors at runtime, configuration can be specified under the ``batching`` key.
Additionally, there are two configurations for customizing batching behaviors, `max_batch_size` and `max_latency_ms`.
If a model supports batching, adaptive batching is enabled by default. To explicitly disable or
control adaptive batching behaviors at runtime, configuration can be specified under the
``batching`` key. Additionally, there are three configuration keys for customizing batching
behaviors, ``max_batch_size``, ``max_latency_ms``, and ``target_latency_ms``.

Max Batch Size
^^^^^^^^^^^^^^

Configured through the ``max_batch_size`` key, max batch size represents the maximum size a batch can reach before releasing for inferencing. Max batch size should be set based on the capacity of the available system resources, e.g. memory or GPU memory.
Configured through the ``max_batch_size`` key, max batch size represents the maximum size a batch
can reach before being released for inferencing. Max batch size should be set based on the capacity
of the available system resources, e.g. memory or GPU memory.

Max Latency
^^^^^^^^^^^

Configured through the ``max_latency_ms`` key, max latency represents the maximum latency in milliseconds that a batch should wait before releasing for inferencing. Max latency should be set based on the service level objective (SLO) of the inference requests.
Configured through the ``max_latency_ms`` key, max latency represents the maximum latency in
milliseconds that the scheduler will attempt to uphold by cancelling requests when it thinks the
runner server is incapable of servicing that request in time. Max latency should be set based on the
service level objective (SLO) of the inference requests.

Target Latency
^^^^^^^^^^^^^^

Configured through the ``target_latency_ms`` key, target latency represents the latency that the
request scheduler will try to meet if possible---that is, if target latency is set to 1000ms, and
the scheduler thinks a batch will take 200ms to execute, it will wait for around 800ms for
additional requests to arrive. Note that this can be set to 0 in order to disable waiting entirely.

If unset or -1, the scheduler will intelligently choose the wait time based on the historical wait
time for previous batches.

Target latency should be set based on how long you wish for requests to be held before they are
executed.

.. code-block:: yaml
:caption: ⚙️ `configuration.yml`
Expand All @@ -74,6 +95,7 @@ Configured through the ``max_latency_ms`` key, max latency represents the maximu
enabled: true
max_batch_size: 100
max_latency_ms: 500
target_latency_ms: ~
Monitoring
----------
Expand Down

0 comments on commit 8b72567

Please sign in to comment.