Different gpu-memory-utilization, same inference time #13542
Unanswered
zhang95-honey
asked this question in
Q&A
Replies: 1 comment
-
I think inference time does not really much depend on your gpu memory, but the number of cores and clock speed. Inference time only depends on gpu memory only when your card memory is limited and you choose to offload memory into your cpu (which is not the case). Also, I think the reduced kv cache remains unused in your card. I am not really sure though, please let me know if im wrong. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
qwen2-vl-7B performs vllm inference on nvidia H20 96GiB, after load model, the peak memory occupation is about 23G, when gpu_memory_utilization=0.3, kv cache_size=96G*0.3-23G=5G, when gpu_memory_utilization=0.9, kv cache size accounts for about 50G.
The question is: as the gpu_memory_utilization decreases, the available kv_cache_size memory decreases, so where does the reduced kv cache go, and gpu_memory_utilization=0.9 and 0.3, why is the inference time basically the same ?
Anyone who understands this issue, please reply to me
Beta Was this translation helpful? Give feedback.
All reactions