-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Partial Loading PR5: Dynamic cache ram/vram limits (#7509)
## Summary This PR enables RAM/VRAM cache size limits to be determined dynamically based on availability. **Config Changes** This PR modifies the app configs in the following ways: - A new `device_working_mem_gb` config was added. This is the amount of non-model working memory to keep available on the execution device (i.e. GPU) when using dynamic cache limits. It default to 3GB. - The `ram` and `vram` configs now default to `None`. If these configs are set, they will take precedence over the dynamic limits. **Note: Some users may have previously overriden the `ram` and `vram` values in their `invokeai.yaml`. They will need to remove these configs to enable the new dynamic limit feature.** **Working Memory** In addition to the new `device_working_mem_gb` config described above, memory-intensive operations can estimate the amount of working memory that they will need and request it from the model cache. This is currently applied to the VAE decoding step for all models. In the future, we may apply this to other operations as we work out which ops tend to exceed the default working memory reservation. **Mitigations for #7513 This PR includes some mitigations for the issue described in #7513. Without these mitigations, it would occur with higher frequency when dynamic RAM limits are used and the RAM is close to maxed-out. ## Limitations / Future Work - Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE decoding requires more working VRAM than available, the best we can do is keep the full model on the CPU, but we will still hit an OOM error. In the future, we could detect this ahead of time and switch to running inference on the CPU for those ops. - There is often a non-negligible amount of VRAM 'reserved' by the torch CUDA allocator, but not used by any allocated tensors. We may be able to tune the torch CUDA allocator to work better for our use case. Reference: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf - There may be some ops that require high working memory that haven't been updated to request extra memory yet. We will update these as we uncover them. - If a model is 'locked' in VRAM, it won't be partially unloaded if a later model load requests extra working memory. This should be uncommon, but I can think of cases where it would matter. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7505 ## QA Instructions Run a variety of models near the cache limits to ensure that model switching works properly for the following configurations: - [x] CUDA, `enable_partial_loading=true`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved in another process so there is limited RAM/VRAM remaining, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=false`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, ram/vram limits set (these should take precedence over the dynamic limits) - [x] MPS, all other default (i.e. dynamic memory limits) - [x] CPU, all other default (i.e. dynamic memory limits) ## Merge Plan - [x] Merge #7505 first and change target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_
- Loading branch information
Showing
20 changed files
with
314 additions
and
298 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.