a surprise for me #13904

ansanyuan · 2025-02-26T15:29:13Z

ansanyuan
Feb 26, 2025

I came into contact with the vllm framework for the first time. I tried to use it for multiple users, and the results shocked me! The vllm framework runs the Qwen 14b model on 2 * 2080Ti. The response speed for a single user is 80tokens/s, and the corresponding speed for 100 users in parallel reaches 25tokens/s. I feel very unbelievable about this. I always thought that the number of parallelisms and the generation speed were linear, but I didn't expect the result to be like this. What caused this result?

hongboshi1234 · 2025-02-27T03:02:19Z

hongboshi1234
Feb 27, 2025

"and the corresponding speed for 100 users in parallel reaches 25tokens/s."

Do you mean you see 25 * 100 as the overall throughput for 100 users? if it's true, it make sense. Usually we should expect a pretty linear relationship between throughput for small batch sizes. but when batch size increase to larger size, throughput will eventually stop increasing as the memory bandwith or compute flops will be saturated.

if you meant you see 25 token/s for the 100 users in total. then there must be something wrong with your setup.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a surprise for me #13904

{{title}}

Replies: 1 comment

{{title}}

Select a reply

a surprise for me #13904

ansanyuan Feb 26, 2025

Replies: 1 comment

hongboshi1234 Feb 27, 2025

ansanyuan
Feb 26, 2025

hongboshi1234
Feb 27, 2025