Support Qwen VL visual projector's 4-bit quantization, not only fp16 #11408
Closed
samkoesnadi
started this conversation in
Ideas
Replies: 2 comments
-
No interest? |
Beta Was this translation helpful? Give feedback.
0 replies
-
In case anyone needs it: #11644 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have been experimenting with Llama cpp's VLM. Mainly now developing with the Qwen 2 VL model for my project.
The text decoder runs super fast on my machine as I am using 4 bit quantization. However, there are only two possible quantization, namely fp16 and fp32.
How difficult it is for us to implement 4-bit quantization for the visual projector?
As a reference, I run the model on Redmi Note 13 Pro. The visual projector runs around 3 tokens per second, whereas the text decoder runs on 7 tokens/s. So it is quite a margin. As performance is crucial on mobile, so having the visual projector quantized would be very helpful.
Beta Was this translation helpful? Give feedback.
All reactions