Fast(-er) way to let whisper.cpp & llama.cpp take turns in CUDA GPU usage? #12061

RhinoDevel · 2025-02-25T05:43:25Z

RhinoDevel
Feb 25, 2025

On a Linux computer with a CUDA GPU, I'd like whisper.cpp and llama.cpp taking
turns in using the GPU for inference.

A kind of clumsy way to do this (this is what I am doing right now):

whisper.cpp: Reserve CUDA device (by reloading the model from file on disk).
whisper.cpp: Do STT inference.
whisper.cpp: Release CUDA device.
llama.cpp: Reserve CUDA device (by reloading the model from file on disk).
llama.cpp: Restore state (context, etc., if available).
llama.cpp: Do inference.
llama.cpp: Backup state to RAM (for next inference).
llama.cpp: Release CUDA device.
=> Back to step (1).

It would be nice, if the reloading of the model from disk and complete
reinitialization of llama.cpp (step 4) could be performance improved, e.g. by
copying the data from the CUDA device to RAM (as backup) before releasing the
GPU (step 8) and restoring the data back into the CUDA device from RAM instead
of doing a full reinitialization from file on disk (step 4).

After investigating the source code I am under the impression that there is no
straight forward solution for this and you would probably need to modify
the llama.cpp source code and maybe use the same ggml instance for whisper.cpp
and llama.cpp, but it can be that I missed something(?).

Thanks in advance.

ExtReMLapin · 2025-02-25T06:16:59Z

ExtReMLapin
Feb 25, 2025

I think he's looking for something like pytorch where we can transfer data (including models) manualy across devices.

This is what we used at the office for our model memory manager, models are moved to CPU when it's possible and there is enough free ram, and moved back to GPU when it's required to be used.

0 replies

ggerganov · 2025-02-25T07:58:46Z

ggerganov
Feb 25, 2025
Maintainer

Have you checked the talk-llama example in whisper.cpp?

1 reply

RhinoDevel Feb 25, 2025
Author

Thanks for the hint, yes, I knew about talk-llama. The code looks as if whisper.cpp and llama.cpp can both use the GPU, I have not tested that, though. But even if this works, neither whisper.cpp nor llama.cpp could use the maximum possible layers the GPU supports - at least not in this example, if I understand correctly(?).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast(-er) way to let whisper.cpp & llama.cpp take turns in CUDA GPU usage? #12061

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Fast(-er) way to let whisper.cpp & llama.cpp take turns in CUDA GPU usage? #12061

RhinoDevel Feb 25, 2025

Replies: 2 comments · 1 reply

ExtReMLapin Feb 25, 2025

ggerganov Feb 25, 2025 Maintainer

RhinoDevel Feb 25, 2025 Author

RhinoDevel
Feb 25, 2025

Replies: 2 comments 1 reply

ExtReMLapin
Feb 25, 2025

ggerganov
Feb 25, 2025
Maintainer

RhinoDevel Feb 25, 2025
Author