Replies: 2 comments
-
I think there should be some quantized models too! |
Beta Was this translation helpful? Give feedback.
0 replies
-
That sounds interesting. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I recently read the paper Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (https://huggingface.co/papers/2502.05171), and it got me thinking about how this could work with llama.cpp. The idea of dynamically scaling compute at inference time by iterating in the latent space instead of generating extra tokens seems pretty interesting, especially for local LLM setups.
Would something like this be feasible with GGUF models in the future? I’m curious...
Beta Was this translation helpful? Give feedback.
All reactions