[Feature] Will multi-modal models support W8A8 quantization in the future? #2496

MenglingD · 2024-09-23T06:36:18Z

Motivation

Our business model (Internvl 2-26B) outputs very few tokens (1-2 tokens) after prompt optimization, which can be considered as only the prefill stage. Therefore, we hope to use W8A8 quantization to speed up the inference process. However, we found that lmdeploy does not support W8A8 inference for multi-modal models: #2042

Could you please explain why W8A8 quantization is not supported for multi-modal models? Is it due to model accuracy concerns?

Related resources

No response

Additional context

No response

AllentDan · 2024-09-24T03:51:33Z

There is a pull request #2308 handling this.

MenglingD · 2024-09-24T06:58:59Z

There is a pull request #2308 handling this.

Thanks, we will try it later and provide timely feedback if any issues arise.

MenglingD · 2024-09-25T01:26:46Z

There is a pull request #2308 handling this.

Also, I'd like to ask if TurboMind plans to support the w8a8 feature for VLM (Vision-Language Models) models in the future?

AllentDan · 2024-09-25T02:16:00Z

Turbomind is only responsible for llm. Vision model in lmdeploy used pytorch.

MenglingD · 2024-09-26T01:15:52Z

Turbomind is only responsible for llm. Vision model in lmdeploy used pytorch.

Excuse me, I made a mistake in my statement. What I actually wanted to ask is whether TurboMind has any plans to support W8A8? Because from the documentation, TurboMind doesn't currently support W8A8：支持的模型

AllentDan · 2024-09-26T02:26:28Z

Yes, there is a plan that TurboMind will support W8A8.

MenglingD changed the title ~~[Feature] Will multi-modal models support W8A8 quantization in the future? Also, could you please explain why none of the current multi-modal models support W8A8 quantization?~~ [Feature] Will multi-modal models support W8A8 quantization in the future? Sep 23, 2024

MenglingD closed this as not planned Won't fix, can't repro, duplicate, stale Sep 23, 2024

MenglingD reopened this Sep 23, 2024

lvhan028 assigned AllentDan Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Will multi-modal models support W8A8 quantization in the future? #2496

[Feature] Will multi-modal models support W8A8 quantization in the future? #2496

MenglingD commented Sep 23, 2024

AllentDan commented Sep 24, 2024

MenglingD commented Sep 24, 2024

MenglingD commented Sep 25, 2024

AllentDan commented Sep 25, 2024

MenglingD commented Sep 26, 2024

AllentDan commented Sep 26, 2024

[Feature] Will multi-modal models support W8A8 quantization in the future? #2496

[Feature] Will multi-modal models support W8A8 quantization in the future? #2496

Comments

MenglingD commented Sep 23, 2024

Motivation

Related resources

Additional context

AllentDan commented Sep 24, 2024

MenglingD commented Sep 24, 2024

MenglingD commented Sep 25, 2024

AllentDan commented Sep 25, 2024

MenglingD commented Sep 26, 2024

AllentDan commented Sep 26, 2024