Less token with speculative decoding on #13834

mahenning · 2025-02-25T17:14:52Z

mahenning
Feb 25, 2025

I wanted to test speculative decoding with Llama 3.3 70b, and benchmarked some (concurrent) requests. Tokens/s dropped 30-50% with speculative decoding enabled, so I just wanted to ask if I misconfigured something or if there is some other reason for this. I know that this feature is stated as "not optimized yet". The token acceptance rate was around 0.71-0.74.
I ran vLLM in a docker container with the 0.7.2 image on Ubuntu with an A100 80GB GPU.

Docker compose file

services:
  vllm-openai:
    container_name: vllm
    volumes:
      - /path/to/model/dir:/root/.cache/huggingface
    environment:
      HUGGING_FACE_HUB_TOKEN: "<token>"
      VLLM_ATTENTION_BACKEND: FLASHINFER
    ports:
      - 8080:8000
    ipc: host
    image: vllm/vllm-openai:latest
    command: 
      --model kosbu/Llama-3.3-70B-Instruct-AWQ
      --gpu-memory-utilization 0.8
      --disable-frontend-multiprocessing
      --max-model-len 32768
      --max-num-seqs 256
      --kv-cache-dtype fp8_e4m3
      --speculative-model joshmiller656/Llama3.2-3B-Instruct-AWQ-INT4
      --num-speculative-tokens 5
     # comment last two lines in to run without speculative decoding
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu

I ran the bechmark file from vllm:

python3 ./vllm/benchmarks/benchmark_serving.py --backend vllm --base-url http://0.0.0.0:8080 --dataset-name=random --model kosbu/Llama-3.3-70B-Instruct-AWQ --num_prompts 100

With speculative decoding enabled:

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  151.35    
Total input tokens:                      102400    
Total generated tokens:                  10321     
Request throughput (req/s):              0.66      
Output token throughput (tok/s):         68.19     
Total Token throughput (tok/s):          744.77    
---------------Time to First Token----------------
Mean TTFT (ms):                          62817.71  
Median TTFT (ms):                        60896.71  
P99 TTFT (ms):                           106286.61 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2169.56   
Median TPOT (ms):                        652.06    
P99 TPOT (ms):                           27469.27  
---------------Inter-token Latency----------------
Mean ITL (ms):                           2298.34   
Median ITL (ms):                         1253.04   
P99 ITL (ms):                            25970.76  
==================================================

Without speculative encoding:

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  110.69    
Total input tokens:                      102400    
Total generated tokens:                  10563     
Request throughput (req/s):              0.90      
Output token throughput (tok/s):         95.43     
Total Token throughput (tok/s):          1020.56   
---------------Time to First Token----------------
Mean TTFT (ms):                          72295.02  
Median TTFT (ms):                        85641.29  
P99 TTFT (ms):                           94882.11  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1427.52   
Median TPOT (ms):                        368.67    
P99 TPOT (ms):                           32033.71  
---------------Inter-token Latency----------------
Mean ITL (ms):                           334.75    
Median ITL (ms):                         100.72    
P99 ITL (ms):                            202.66    
==================================================

Every metric except TTFT is straight worse. Is there an explanation for it? Is it because of the AWQ model quants, or the fp8 kv cache? I briefly tested it with the normal kv cache and default attention backend (flash attention), but it doesn't seem to make a difference. The inter-token latency is also pretty bad. Could it be that the acceptance rate of ~0.72 is too low and it is faster if the main model just does the predicition itself?
Or is speculative decoding just not suited for concurrent requests? I tested the benchmarking with added parameters --max-concurrency 10 --request-rate 4 (4 reqests per second goes in, the engine works on 10 max at the same time) to lower the load of too many requests, but I got roughly the same numbers.

I got the idea from this site: https://olof.tech/llama-3-3-on-vllm-with-speculative-decoding/
There, the author used a fp8 model for both and a H100, but I don't have the VRAM for that and also the A100 don't support native fp8. But the article stated a 2x speedup, which I am faar away from. The article did mention though that the speedup and acceptance rate was lower with a 4-bit GPTQ draft model.

Any ideas or input discussions are greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less token with speculative decoding on #13834

{{title}}

Replies: 0 comments

Select a reply

Less token with speculative decoding on #13834

mahenning Feb 25, 2025

Replies: 0 comments

mahenning
Feb 25, 2025