Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pytorch_cuda_alloc_conf config to tune VRAM memory allocation #7673

Open
wants to merge 6 commits into
base: ryan/tidy-entry
Choose a base branch
from

Conversation

RyanJDick
Copy link
Collaborator

Summary

This PR adds a pytorch_cuda_alloc_conf config flag to control the torch memory allocator behavior.

  • pytorch_cuda_alloc_conf defaults to None, preserving the current behavior.
  • The configuration options are explained here: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf. Tuning this configuration can reduce peak reserved VRAM and improve performance.
  • Setting pytorch_cuda_alloc_conf: "backend:cudaMallocAsync" in invokeai.yaml is expected to work well on many systems. This is a good first step for those looking to tune this config. (We may make this the default in the future.)
  • The optimal configuration seems to be dependent on a number of factors such as device version, VRAM, CUDA kernel version, etc. For now, users will have to experiment with this config to see if it hurts or helps on their systems. In most cases, I expect it to help.

Memory Tests

VAE decode memory usage comparison:

- SDXL, fp16, 1024x1024:
  - `cudaMallocAsync`: allocated=2593 MB, reserved=3200 MB
  - `native`:          allocated=2595 MB, reserved=4418 MB

- SDXL, fp32, 1024x1024:
  - `cudaMallocAsync`: allocated=3982 MB, reserved=5536 MB
  - `native`:          allocated=3982 MB, reserved=7276 MB

- SDXL, fp32, 1536x1536:
  - `cudaMallocAsync`: allocated=8643 MB, reserved=12032 MB
  - `native`:          allocated=8643 MB, reserved=15900 MB

Related Issues / Discussions

N/A

QA Instructions

  • Performance tests with pytorch_cuda_alloc_conf unset.
  • Performance tests with pytorch_cuda_alloc_conf: "backend:cudaMallocAsync".

Merge Plan

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@github-actions github-actions bot added python PRs that change python files services PRs that change app services python-tests PRs that change python tests labels Feb 24, 2025
… config field that allows full customization of the CUDA allocator.
@github-actions github-actions bot added the docs PRs that change docs label Feb 24, 2025
@RyanJDick RyanJDick marked this pull request as ready for review February 24, 2025 20:57
@hipsterusername
Copy link
Member

As confirmation, i presume this does not play nicely on AMD?

@RyanJDick
Copy link
Collaborator Author

As confirmation, i presume this does not play nicely on AMD?

I haven't tested on AMD, but I would not expect the recommended config of backend:cudaMallocAsync to work on AMD. That being said, the native allocator configs documented here might work with AMD (don't have a way to test and couldn't find it documented clearly anywhere). We'd need someone to test whether they do and experiment to find a good recommendation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE docs PRs that change docs python PRs that change python files python-tests PRs that change python tests services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants