Parallel processing of runners #4380

DenisStefanAndrei · 2023-12-29T13:33:29Z

DenisStefanAndrei
Dec 29, 2023

Hello. I am running the following service
`
import bentoml
from openllm import LLM
from bentoml.io import JSON
from bentoml.io import Text
import asyncio
import logging
from time import time

ch = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)

bentoml_logger = logging.getLogger("bentoml")
bentoml_logger.addHandler(ch)
bentoml_logger.setLevel(logging.DEBUG)
class LLMRunnable(bentoml.Runnable):
SUPPORTED_RESOURCES = ("nvidia.com/gpu", )
SUPPORTS_CPU_MULTI_THREADING = True
def init(self, model):
self.llm = LLM(model, backend="vllm", quantize="awq", max_model_len=3000, gpu_memory_utilization=0.5)

async def generate_async(self, prompt):
    result = await self.llm.generate(prompt)
    return result.outputs[0].text

@bentoml.Runnable.method(batchable=False, batch_dim=0)
def generate(self, prompt):
    print("generate method")
    return asyncio.run(self.generate_async(prompt))

mistral_awq_instruct = bentoml.Runner(
LLMRunnable,
name="my_runner_1",
runnable_init_params={
"model": "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
}
)

mistral_2 = bentoml.Runner(LLMRunnable, name="mistral_runner_2", runnable_init_params={"model": "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"} )
svc = bentoml.Service(name, runners=[mistral_awq_instruct, mistral_2])

@svc.api(input=Text(), output=JSON())
async def generate(prompt: str) -> any:
t1 = time()
print("generate endpoint")
output = await mistral_awq_instruct.generate.async_run(prompt)
print("execution time is: ", t1 - time())
return output
`

I make 4 parallel requests to try and user the 2 runners available in the service but only one of them is being used.

How can I take advantage of the 2 runners for this service to have a better inference time?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

Parallel processing of runners #4380

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

BentoML

Parallel processing of runners #4380

DenisStefanAndrei Dec 29, 2023

Replies: 0 comments

DenisStefanAndrei
Dec 29, 2023