Understanding Matmul Kernel Support for BF16 and INT8 Quantization in llama.cpp on x86 CPUs #11734
Replies: 1 comment
-
There is lot optimisation for MatMul (may be some optim left on some hardware...) If you want more internal idea you can look at https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-cpu/llamafile/sgemm.cpp it is one of the optimisation, more exists. For BF16 it is optimized for zen4 with ~90% of the theoretical Max on this CPU I think in most case. I'll have some Idea for some more gain but it is hard to code, may be have some more gain en zen5 with height core count. (my personal feeling... 🤞 ) Intel Int8 have more optim here : https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-cpu/amx/mmq.cpp there is some optim with BLAS here: https://github.com/ggerganov/llama.cpp/blob/master/ggml/src/ggml-blas/ggml-blas.cpp but I do not think it have any gain on x86 arch over what is in "sgemm". |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
I'm currently working with llama.cpp on x86 CPUs and have some questions regarding the support and implementation of matrix multiplication (matmul) kernels for various data types and quantization schemes. I hope to get some insights from the community or contributors to better understand these aspects.
For k-bit quantization (e.g., 4-bit, 5-bit quantization):
Does llama.cpp internally represent these quantized weights using INT8?
During computation, are the quantized weights dequantized to a higher precision (e.g., FP16 or FP32) and then requantized, or are operations performed directly on the quantized representations?
Which specific matmul kernels are utilized for computations involving k-bit quantization?
I appreciate any clarification or insights you can provide on these topics. Understanding the underlying implementations will greatly help in optimizing models and leveraging llama.cpp effectively on x86 hardware.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions