-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: matmul dequantization improvements #12015
base: master
Are you sure you want to change the base?
Conversation
I did a quick run on RTX 4070 using the KHR_coopmat path (GGML_VK_DISABLE_COOPMAT2=1). Perf is about neutral on average, maybe down a tiny bit?
The backend tests all passed. |
Interesting. Let's wait for some more results. |
Here are my results: Nvidia RTX 3090
AMD Radeon Pro VII
Intel A770
Looks like it's mostly perf-neutral on AMD and Intel, probably since they are compute-limited. Some minor improvements. But on RTX 3090 it makes a much larger difference. Looks good for legacy and k-quants, but iq quants seem to be negative. Any ideas? |
const float d = float(data_a_packed16[ib].d); | ||
const uint v0 = uint(data_a_packed16[ib].qs[2*iqs]); | ||
const uint v1 = uint(data_a_packed16[ib].qs[2*iqs + 1]); | ||
const vec4 v = vec4(int8_t(v0 & 0xFF), int8_t(v0 >> 8), int8_t(v1 & 0xFF), int8_t(v1 >> 8)) * d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use data_a_packed32
here? Then you can directly get the vec4 from an unpack8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since each block has a 16 bit delta each q8_0 block takes up 34 bytes. That's not divisible by 4 bytes and we end up with an unaligned 32 bit load that's slower than a 16 bit one (I think I tried that a long time ago when I did the inference optimizations).
Maybe there's a way to repack the blocks and stuff an extra 16 bits at the end to make it 36 bytes, but that'll use up more memory and it sounds like a lot of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could repack the tensor such that the quants and the scales are stored separately, then it would be an aligned 32 byte load, plus a 2 byte load for the scale.
This basically makes the mul_mm shaders load and dequantize 4 or 8 values at a time like how it's done in mat_vec (old quants only).
Results on my RX 470:
PR
Master
PR:
I'm only seeing a small improvement as most of the GPU time is spent doing the actual multiplication, and I think we'll see better results on something that supports coopmat.