Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pytorch_cuda_alloc_conf config to tune VRAM memory allocation #7673

Open
wants to merge 6 commits into
base: ryan/tidy-entry
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/features/low-vram.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ It is possible to fine-tune the settings for best performance or if you still ge
Low-VRAM mode involves 4 features, each of which can be configured or fine-tuned:

- Partial model loading (`enable_partial_loading`)
- PyTorch CUDA allocator config (`pytorch_cuda_alloc_conf`)
- Dynamic RAM and VRAM cache sizes (`max_cache_ram_gb`, `max_cache_vram_gb`)
- Working memory (`device_working_mem_gb`)
- Keeping a RAM weight copy (`keep_ram_copy_of_weights`)
Expand All @@ -51,6 +52,16 @@ As described above, you can enable partial model loading by adding this line to
enable_partial_loading: true
```

### PyTorch CUDA allocator config

The PyTorch CUDA allocator's behavior can be configured using the `pytorch_cuda_alloc_conf` config. Tuning the allocator configuration can help to reduce the peak reserved VRAM. The optimal configuration is dependent on many factors (e.g. device type, VRAM, CUDA driver version, etc.), but switching from PyTorch's native allocator to using CUDA's built-in allocator works well on many systems. To try this, add the following line to your `invokeai.yaml` file:

```yaml
pytorch_cuda_alloc_conf: "backend:cudaMallocAsync"
```

A more complete explanation of the available configuration options is [here](https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).

### Dynamic RAM and VRAM cache sizes

Loading models from disk is slow and can be a major bottleneck for performance. Invoke uses two model caches - RAM and VRAM - to reduce loading from disk to a minimum.
Expand Down
22 changes: 15 additions & 7 deletions invokeai/app/run_app.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,7 @@
import uvicorn

from invokeai.app.services.config.config_default import get_config
from invokeai.app.util.startup_utils import (
apply_monkeypatches,
check_cudnn,
enable_dev_reload,
find_open_port,
register_mime_types,
)
from invokeai.app.util.torch_cuda_allocator import configure_torch_cuda_allocator
from invokeai.backend.util.logging import InvokeAILogger
from invokeai.frontend.cli.arg_parser import InvokeAIArgs

Expand All @@ -31,6 +25,20 @@ def run_app() -> None:

logger = InvokeAILogger.get_logger(config=app_config)

# Configure the torch CUDA memory allocator.
# NOTE: It is important that this happens before torch is imported.
if app_config.pytorch_cuda_alloc_conf:
configure_torch_cuda_allocator(app_config.pytorch_cuda_alloc_conf, logger)

# Import from startup_utils here to avoid importing torch before enable_torch_cuda_malloc() is called.
from invokeai.app.util.startup_utils import (
apply_monkeypatches,
check_cudnn,
enable_dev_reload,
find_open_port,
register_mime_types,
)

# Find an open port, and modify the config accordingly.
orig_config_port = app_config.port
app_config.port = find_open_port(app_config.port)
Expand Down
4 changes: 4 additions & 0 deletions invokeai/app/services/config/config_default.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ class InvokeAIAppConfig(BaseSettings):
ram: DEPRECATED: This setting is no longer used. It has been replaced by `max_cache_ram_gb`, but most users will not need to use this config since automatic cache size limits should work well in most cases. This config setting will be removed once the new model cache behavior is stable.
vram: DEPRECATED: This setting is no longer used. It has been replaced by `max_cache_vram_gb`, but most users will not need to use this config since automatic cache size limits should work well in most cases. This config setting will be removed once the new model cache behavior is stable.
lazy_offload: DEPRECATED: This setting is no longer used. Lazy-offloading is enabled by default. This config setting will be removed once the new model cache behavior is stable.
pytorch_cuda_alloc_conf: Configure the Torch CUDA memory allocator. This will impact peak reserved VRAM usage and performance. Setting to "backend:cudaMallocAsync" works well on many systems. The optimal configuration is highly dependent on the system configuration (device type, VRAM, CUDA driver version, etc.), so must be tuned experimentally.
device: Preferred execution device. `auto` will choose the device depending on the hardware platform and the installed torch capabilities.<br>Valid values: `auto`, `cpu`, `cuda`, `cuda:1`, `mps`
precision: Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.<br>Valid values: `auto`, `float16`, `bfloat16`, `float32`
sequential_guidance: Whether to calculate guidance in serial instead of in parallel, lowering memory requirements.
Expand Down Expand Up @@ -169,6 +170,9 @@ class InvokeAIAppConfig(BaseSettings):
vram: Optional[float] = Field(default=None, ge=0, description="DEPRECATED: This setting is no longer used. It has been replaced by `max_cache_vram_gb`, but most users will not need to use this config since automatic cache size limits should work well in most cases. This config setting will be removed once the new model cache behavior is stable.")
lazy_offload: bool = Field(default=True, description="DEPRECATED: This setting is no longer used. Lazy-offloading is enabled by default. This config setting will be removed once the new model cache behavior is stable.")

# PyTorch Memory Allocator
pytorch_cuda_alloc_conf: Optional[str] = Field(default=None, description="Configure the Torch CUDA memory allocator. This will impact peak reserved VRAM usage and performance. Setting to \"backend:cudaMallocAsync\" works well on many systems. The optimal configuration is highly dependent on the system configuration (device type, VRAM, CUDA driver version, etc.), so must be tuned experimentally.")

# DEVICE
device: DEVICE = Field(default="auto", description="Preferred execution device. `auto` will choose the device depending on the hardware platform and the installed torch capabilities.")
precision: PRECISION = Field(default="auto", description="Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.")
Expand Down
42 changes: 42 additions & 0 deletions invokeai/app/util/torch_cuda_allocator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import logging
import os


def configure_torch_cuda_allocator(pytorch_cuda_alloc_conf: str, logger: logging.Logger | None = None):
"""Configure the PyTorch CUDA memory allocator. See
https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf for supported
configurations.
"""

# Raise if the PYTORCH_CUDA_ALLOC_CONF environment variable is already set.
prev_cuda_alloc_conf = os.environ.get("PYTORCH_CUDA_ALLOC_CONF", None)
if prev_cuda_alloc_conf is not None:
raise RuntimeError(
f"Attempted to configure the PyTorch CUDA memory allocator, but PYTORCH_CUDA_ALLOC_CONF is already set to "
f"'{prev_cuda_alloc_conf}'."
)

# Configure the PyTorch CUDA memory allocator.
# NOTE: It is important that this happens before torch is imported.
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = pytorch_cuda_alloc_conf

import torch

# Relevant docs: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf
if not torch.cuda.is_available():
raise RuntimeError(
"Attempted to configure the PyTorch CUDA memory allocator, but no CUDA devices are available."
)

# Verify that the torch allocator was properly configured.
allocator_backend = torch.cuda.get_allocator_backend()
expected_backend = "cudaMallocAsync" if "cudaMallocAsync" in pytorch_cuda_alloc_conf else "native"
if allocator_backend != expected_backend:
raise RuntimeError(
f"Failed to configure the PyTorch CUDA memory allocator. Expected backend: '{expected_backend}', but got "
f"'{allocator_backend}'. Verify that 1) the pytorch_cuda_alloc_conf is set correctly, and 2) that torch is "
"not imported before calling configure_torch_cuda_allocator()."
)

if logger is not None:
logger.info(f"PyTorch CUDA memory allocator: {torch.cuda.get_allocator_backend()}")
13 changes: 13 additions & 0 deletions tests/app/util/test_torch_cuda_allocator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import pytest
import torch

from invokeai.app.util.torch_cuda_allocator import configure_torch_cuda_allocator


@pytest.mark.skipif(not torch.cuda.is_available(), reason="Requires CUDA device.")
def test_configure_torch_cuda_allocator_raises_if_torch_is_already_imported():
"""Test that configure_torch_cuda_allocator() raises a RuntimeError if torch is already imported."""
import torch # noqa: F401

with pytest.raises(RuntimeError, match="Failed to configure the PyTorch CUDA memory allocator."):
configure_torch_cuda_allocator("backend:cudaMallocAsync")