a surprise for me
#13904
Replies: 1 comment
-
"and the corresponding speed for 100 users in parallel reaches 25tokens/s." Do you mean you see 25 * 100 as the overall throughput for 100 users? if it's true, it make sense. Usually we should expect a pretty linear relationship between throughput for small batch sizes. but when batch size increase to larger size, throughput will eventually stop increasing as the memory bandwith or compute flops will be saturated. if you meant you see 25 token/s for the 100 users in total. then there must be something wrong with your setup. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I came into contact with the vllm framework for the first time. I tried to use it for multiple users, and the results shocked me! The vllm framework runs the Qwen 14b model on 2 * 2080Ti. The response speed for a single user is 80tokens/s, and the corresponding speed for 100 users in parallel reaches 25tokens/s. I feel very unbelievable about this. I always thought that the number of parallelisms and the generation speed were linear, but I didn't expect the result to be like this. What caused this result?
Beta Was this translation helpful? Give feedback.
All reactions