How can online inference or server based inference be made faster? #11488

SefaZeng · 2024-12-25T07:55:56Z

SefaZeng
Dec 25, 2024

I've tried both offline batch inference and server inference. I found that with the same dataset and the same model, the speed of server inference is more than twice as slow as that of offline batch inference.

I guess the main reason is that offline batch inference uses batched inference, because the default value of max_num_seqs is 256. (Please correct me if my understanding is wrong.)

If I change the offline inference to input one by one, it will become very slow.

However, I don't know how to transform server inference into the form of batch inference.

Also, I'm wondering if there are other options that have caused the server to slow down. If there are, please let me know. I'd be really grateful!

The offline batch inference scripts is like:

...
        llm = LLM(
            model=args.model_name_or_path,
            tensor_parallel_size=len(available_gpus) // args.pipeline_parallel_size,
            pipeline_parallel_size=args.pipeline_parallel_size,
            trust_remote_code=True,
            max_seq_len_to_capture=args.max_tokens_per_call,
        )
...
        outputs = llm.generate(samples, sampling_params)

The server scripts:

CUDA_VISIBLE_DEVICES=0 vllm serve ${MODEL} --max-model-len 4096 --tensor-parallel-size 1 --port 8000 --enforce-eager

And the client scripts:

...

        client = OpenAI(
            api_key=openai_api_key,
            base_url=openai_api_base.format(**data),
        )
        system = ''' 
SOME SYSTEM PROMPT
        '''
        prompt = ''' 
PROMPT
        '''
        # data is one sampple
        prompt = prompt.format(**data)
        msgs = [ 
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ]
        response = client.chat.completions.create(
            messages=msgs,
            model=model,
            max_tokens=1024,
            temperature=1.0,
            top_p=0.8,
            n=1,
            seed=data['seed'],
            extra_body={'repetition_penalty': 1.05}
        )   
        data['response'] = response.choices[0].message.content
...

zjsxply · 2025-02-25T03:35:54Z

zjsxply
Feb 25, 2025

You can use the asynchronous version of the OpenAI client, AsyncOpenAI, along with asyncio.gather to process the entire dataset concurrently. You may need to reduce the max_num_seqs of vLLM to prevent the processing time of a single request from becoming too long, and increase the timeout for the OpenAI client to ensure that the HTTP requests do not time out. Based on my tests, setting max_num_seqs to around 20 or higher works well, as it significantly outperforms the value of 1. However, setting it beyond 50 doesn’t provide a noticeable improvement in throughput.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can online inference or server based inference be made faster? #11488

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How can online inference or server based inference be made faster? #11488

SefaZeng Dec 25, 2024

Replies: 1 comment

zjsxply Feb 25, 2025

SefaZeng
Dec 25, 2024

zjsxply
Feb 25, 2025