Understanding Matmul Kernel Support for BF16 and INT8 Quantization in llama.cpp on x86 CPUs #11734

lalith1403 · 2025-02-07T14:17:32Z

lalith1403
Feb 7, 2025

Hello everyone,

I'm currently working with llama.cpp on x86 CPUs and have some questions regarding the support and implementation of matrix multiplication (matmul) kernels for various data types and quantization schemes. I hope to get some insights from the community or contributors to better understand these aspects.

Matmul Kernels for BF16 and INT8 on x86 CPUs:

BF16 Support:
- Does llama.cpp have native support for BF16 (bfloat16) data types on x86 CPUs?
- Are the BF16 matmul operations implemented natively within llama.cpp, or do they rely on third-party libraries? If so, which libraries are used?
INT8 Support:
- Is there native support for INT8 data types in llama.cpp for x86 CPUs?
- What matmul kernels are used for INT8 computations, especially in the context of quantized models?

Quantization Schemes and Kernels:

For k-bit quantization (e.g., 4-bit, 5-bit quantization):
- Does llama.cpp internally represent these quantized weights using INT8?
- During computation, are the quantized weights dequantized to a higher precision (e.g., FP16 or FP32) and then requantized, or are operations performed directly on the quantized representations?
- Which specific matmul kernels are utilized for computations involving k-bit quantization?

FP32 Support and Libraries:

When operating with FP32 precision:
- Which libraries or kernels does llama.cpp utilize for matmul operations on x86 CPUs?
- Are there any optimizations or specific implementations in place for FP32 computations within llama.cpp?

General Performance Considerations:

Are there any recommendations or best practices for optimizing llama.cpp performance on x86 CPUs, particularly concerning the use of BF16 and INT8 data types?
How does llama.cpp handle hardware capabilities, such as AVX-512 support, that may enhance the performance of BF16 and INT8 operations?

I appreciate any clarification or insights you can provide on these topics. Understanding the underlying implementations will greatly help in optimizing models and leveraging llama.cpp effectively on x86 hardware.
Thank you!

Djip007 · 2025-02-10T23:54:55Z

Djip007
Feb 10, 2025

There is lot optimisation for MatMul (may be some optim left on some hardware...) If you want more internal idea you can look at https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-cpu/llamafile/sgemm.cpp it is one of the optimisation, more exists.

For BF16 it is optimized for zen4 with ~90% of the theoretical Max on this CPU I think in most case. I'll have some Idea for some more gain but it is hard to code, may be have some more gain en zen5 with height core count. (my personal feeling... 🤞 )

Intel Int8 have more optim here : https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-cpu/amx/mmq.cpp

there is some optim with BLAS here: https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-blas/ggml-blas.cpp but I do not think it have any gain on x86 arch over what is in "sgemm".

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Matmul Kernel Support for BF16 and INT8 Quantization in llama.cpp on x86 CPUs #11734

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Understanding Matmul Kernel Support for BF16 and INT8 Quantization in llama.cpp on x86 CPUs #11734

lalith1403 Feb 7, 2025

Replies: 1 comment

Djip007 Feb 10, 2025

lalith1403
Feb 7, 2025

Djip007
Feb 10, 2025