You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to test speculative decoding with Llama 3.3 70b, and benchmarked some (concurrent) requests. Tokens/s dropped 30-50% with speculative decoding enabled, so I just wanted to ask if I misconfigured something or if there is some other reason for this. I know that this feature is stated as "not optimized yet". The token acceptance rate was around 0.71-0.74.
I ran vLLM in a docker container with the 0.7.2 image on Ubuntu with an A100 80GB GPU.
Docker compose file
services:
vllm-openai:
container_name: vllmvolumes:
- /path/to/model/dir:/root/.cache/huggingfaceenvironment:
HUGGING_FACE_HUB_TOKEN: "<token>"VLLM_ATTENTION_BACKEND: FLASHINFERports:
- 8080:8000ipc: hostimage: vllm/vllm-openai:latestcommand:
--model kosbu/Llama-3.3-70B-Instruct-AWQ--gpu-memory-utilization 0.8--disable-frontend-multiprocessing--max-model-len 32768--max-num-seqs 256--kv-cache-dtype fp8_e4m3--speculative-model joshmiller656/Llama3.2-3B-Instruct-AWQ-INT4--num-speculative-tokens 5# comment last two lines in to run without speculative decodingdeploy:
resources:
reservations:
devices:
- driver: nvidiacount: allcapabilities:
- gpu
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 151.35
Total input tokens: 102400
Total generated tokens: 10321
Request throughput (req/s): 0.66
Output token throughput (tok/s): 68.19
Total Token throughput (tok/s): 744.77
---------------Time to First Token----------------
Mean TTFT (ms): 62817.71
Median TTFT (ms): 60896.71
P99 TTFT (ms): 106286.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2169.56
Median TPOT (ms): 652.06
P99 TPOT (ms): 27469.27
---------------Inter-token Latency----------------
Mean ITL (ms): 2298.34
Median ITL (ms): 1253.04
P99 ITL (ms): 25970.76
==================================================
Without speculative encoding:
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 110.69
Total input tokens: 102400
Total generated tokens: 10563
Request throughput (req/s): 0.90
Output token throughput (tok/s): 95.43
Total Token throughput (tok/s): 1020.56
---------------Time to First Token----------------
Mean TTFT (ms): 72295.02
Median TTFT (ms): 85641.29
P99 TTFT (ms): 94882.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1427.52
Median TPOT (ms): 368.67
P99 TPOT (ms): 32033.71
---------------Inter-token Latency----------------
Mean ITL (ms): 334.75
Median ITL (ms): 100.72
P99 ITL (ms): 202.66
==================================================
Every metric except TTFT is straight worse. Is there an explanation for it? Is it because of the AWQ model quants, or the fp8 kv cache? I briefly tested it with the normal kv cache and default attention backend (flash attention), but it doesn't seem to make a difference. The inter-token latency is also pretty bad. Could it be that the acceptance rate of ~0.72 is too low and it is faster if the main model just does the predicition itself?
Or is speculative decoding just not suited for concurrent requests? I tested the benchmarking with added parameters --max-concurrency 10 --request-rate 4 (4 reqests per second goes in, the engine works on 10 max at the same time) to lower the load of too many requests, but I got roughly the same numbers.
I got the idea from this site: https://olof.tech/llama-3-3-on-vllm-with-speculative-decoding/
There, the author used a fp8 model for both and a H100, but I don't have the VRAM for that and also the A100 don't support native fp8. But the article stated a 2x speedup, which I am faar away from. The article did mention though that the speedup and acceptance rate was lower with a 4-bit GPTQ draft model.
Any ideas or input discussions are greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I wanted to test speculative decoding with Llama 3.3 70b, and benchmarked some (concurrent) requests. Tokens/s dropped 30-50% with speculative decoding enabled, so I just wanted to ask if I misconfigured something or if there is some other reason for this. I know that this feature is stated as "not optimized yet". The token acceptance rate was around 0.71-0.74.
I ran vLLM in a docker container with the 0.7.2 image on Ubuntu with an A100 80GB GPU.
Docker compose file
I ran the bechmark file from vllm:
With speculative decoding enabled:
Without speculative encoding:
Every metric except TTFT is straight worse. Is there an explanation for it? Is it because of the AWQ model quants, or the fp8 kv cache? I briefly tested it with the normal kv cache and default attention backend (flash attention), but it doesn't seem to make a difference. The inter-token latency is also pretty bad. Could it be that the acceptance rate of ~0.72 is too low and it is faster if the main model just does the predicition itself?
Or is speculative decoding just not suited for concurrent requests? I tested the benchmarking with added parameters
--max-concurrency 10 --request-rate 4
(4 reqests per second goes in, the engine works on 10 max at the same time) to lower the load of too many requests, but I got roughly the same numbers.I got the idea from this site: https://olof.tech/llama-3-3-on-vllm-with-speculative-decoding/
There, the author used a fp8 model for both and a H100, but I don't have the VRAM for that and also the A100 don't support native fp8. But the article stated a 2x speedup, which I am faar away from. The article did mention though that the speedup and acceptance rate was lower with a 4-bit GPTQ draft model.
Any ideas or input discussions are greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions