Different gpu-memory-utilization, same inference time #13542

zhang95-honey · 2025-02-19T10:18:20Z

zhang95-honey
Feb 19, 2025

qwen2-vl-7B performs vllm inference on nvidia H20 96GiB, after load model, the peak memory occupation is about 23G, when gpu_memory_utilization=0.3, kv cache_size=96G*0.3-23G=5G, when gpu_memory_utilization=0.9, kv cache size accounts for about 50G.

The question is: as the gpu_memory_utilization decreases, the available kv_cache_size memory decreases, so where does the reduced kv cache go, and gpu_memory_utilization=0.9 and 0.3, why is the inference time basically the same ？

Anyone who understands this issue, please reply to me

toncao · 2025-02-20T20:18:19Z

toncao
Feb 20, 2025

I think inference time does not really much depend on your gpu memory, but the number of cores and clock speed. Inference time only depends on gpu memory only when your card memory is limited and you choose to offload memory into your cpu (which is not the case).

Also, I think the reduced kv cache remains unused in your card.

I am not really sure though, please let me know if im wrong.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different gpu-memory-utilization, same inference time #13542

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Different gpu-memory-utilization, same inference time #13542

zhang95-honey Feb 19, 2025

Replies: 1 comment

toncao Feb 20, 2025

zhang95-honey
Feb 19, 2025

toncao
Feb 20, 2025