Fast(-er) way to let whisper.cpp & llama.cpp take turns in CUDA GPU usage? #12061
Unanswered
RhinoDevel
asked this question in
Q&A
Replies: 2 comments 1 reply
-
I think he's looking for something like pytorch where we can transfer data (including models) manualy across devices. This is what we used at the office for our model memory manager, models are moved to CPU when it's possible and there is enough free ram, and moved back to GPU when it's required to be used. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Have you checked the talk-llama example in |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
On a Linux computer with a CUDA GPU, I'd like whisper.cpp and llama.cpp taking
turns in using the GPU for inference.
A kind of clumsy way to do this (this is what I am doing right now):
=> Back to step (1).
It would be nice, if the reloading of the model from disk and complete
reinitialization of llama.cpp (step 4) could be performance improved, e.g. by
copying the data from the CUDA device to RAM (as backup) before releasing the
GPU (step 8) and restoring the data back into the CUDA device from RAM instead
of doing a full reinitialization from file on disk (step 4).
After investigating the source code I am under the impression that there is no
straight forward solution for this and you would probably need to modify
the llama.cpp source code and maybe use the same ggml instance for whisper.cpp
and llama.cpp, but it can be that I missed something(?).
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions