-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Will multi-modal models support W8A8 quantization in the future? #2496
Comments
There is a pull request #2308 handling this. |
Thanks, we will try it later and provide timely feedback if any issues arise. |
Also, I'd like to ask if TurboMind plans to support the w8a8 feature for VLM (Vision-Language Models) models in the future? |
Turbomind is only responsible for llm. Vision model in lmdeploy used pytorch. |
Excuse me, I made a mistake in my statement. What I actually wanted to ask is whether TurboMind has any plans to support W8A8? Because from the documentation, TurboMind doesn't currently support W8A8:支持的模型 |
Yes, there is a plan that TurboMind will support W8A8. |
Motivation
Our business model (Internvl 2-26B) outputs very few tokens (1-2 tokens) after prompt optimization, which can be considered as only the prefill stage. Therefore, we hope to use W8A8 quantization to speed up the inference process. However, we found that lmdeploy does not support W8A8 inference for multi-modal models: #2042
Could you please explain why W8A8 quantization is not supported for multi-modal models? Is it due to model accuracy concerns?
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: