Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Will multi-modal models support W8A8 quantization in the future? #2496

Open
MenglingD opened this issue Sep 23, 2024 · 6 comments
Open
Assignees

Comments

@MenglingD
Copy link

Motivation

Our business model (Internvl 2-26B) outputs very few tokens (1-2 tokens) after prompt optimization, which can be considered as only the prefill stage. Therefore, we hope to use W8A8 quantization to speed up the inference process. However, we found that lmdeploy does not support W8A8 inference for multi-modal models: #2042

Could you please explain why W8A8 quantization is not supported for multi-modal models? Is it due to model accuracy concerns?

Related resources

No response

Additional context

No response

@MenglingD MenglingD changed the title [Feature] Will multi-modal models support W8A8 quantization in the future? Also, could you please explain why none of the current multi-modal models support W8A8 quantization? [Feature] Will multi-modal models support W8A8 quantization in the future? Sep 23, 2024
@MenglingD MenglingD closed this as not planned Won't fix, can't repro, duplicate, stale Sep 23, 2024
@MenglingD MenglingD reopened this Sep 23, 2024
@AllentDan
Copy link
Collaborator

There is a pull request #2308 handling this.

@MenglingD
Copy link
Author

There is a pull request #2308 handling this.

Thanks, we will try it later and provide timely feedback if any issues arise.

@MenglingD
Copy link
Author

There is a pull request #2308 handling this.

Also, I'd like to ask if TurboMind plans to support the w8a8 feature for VLM (Vision-Language Models) models in the future?

@AllentDan
Copy link
Collaborator

Turbomind is only responsible for llm. Vision model in lmdeploy used pytorch.

@MenglingD
Copy link
Author

Turbomind is only responsible for llm. Vision model in lmdeploy used pytorch.

Excuse me, I made a mistake in my statement. What I actually wanted to ask is whether TurboMind has any plans to support W8A8? Because from the documentation, TurboMind doesn't currently support W8A8:支持的模型

@AllentDan
Copy link
Collaborator

Yes, there is a plan that TurboMind will support W8A8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants