Is llama.cpp "aware" of dynamic quant layer sizes when loading models? #12046

justinjja · 2025-02-24T05:14:42Z

justinjja
Feb 24, 2025

Loading up the Dynamic R1 Q2-XL GGUF with default (no -ts) I get this ram usage:

| 0 N/A N/A 3968 C ./bin/llama-server 14006MiB |
| 1 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 2 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 3 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 4 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 5 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 6 N/A N/A 3968 C ./bin/llama-server 22866MiB |
| 7 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 8 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 9 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 10 N/A N/A 3968 C ./bin/llama-server 19158MiB |
| 11 N/A N/A 3968 C ./bin/llama-server 16188MiB |

Given that the average layer is ~3400MB this seems like a more even spread should be possible.
Messing around with -ts I'm able to get a better distribution with this -ts 6,5,5,5,5,5,5,5,5,5,5,5:

| 0 N/A N/A 4327 C ./bin/llama-server 17714MiB |
| 1 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 2 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 3 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 4 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 5 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 6 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 7 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 8 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 9 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 10 N/A N/A 4327 C ./bin/llama-server 19158MiB |
| 11 N/A N/A 4327 C ./bin/llama-server 16188MiB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is llama.cpp "aware" of dynamic quant layer sizes when loading models? #12046

{{title}}

Replies: 0 comments

Select a reply

Is llama.cpp "aware" of dynamic quant layer sizes when loading models? #12046

justinjja Feb 24, 2025

Replies: 0 comments

justinjja
Feb 24, 2025