-
Notifications
You must be signed in to change notification settings - Fork 51
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
12 changed files
with
265 additions
and
104 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
name = "AMDGPU" | ||
uuid = "21141c5a-9bdb-4563-92ae-f87d6854732e" | ||
authors = ["Julian P Samaroo <[email protected]>", "Valentin Churavy <[email protected]>", "Anton Smirnov <[email protected]>"] | ||
version = "1.1.3" | ||
version = "1.1.4" | ||
|
||
[deps] | ||
AbstractFFTs = "621f4979-c628-5d54-868e-fcf4e3e8185c" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Caching Memory Allocator | ||
|
||
Julia uses Garbage-Collection (GC) for automatic memory management. | ||
However, it does not know about other memory spaces, | ||
therefore it sees no difference between 1 KiB GPU allocation and 1 GiB | ||
and doesn't free it in time. | ||
|
||
This leads to a situations where all of the GPU memory is used, | ||
even though your algorithm only requires a fraction of it. | ||
|
||
Current mechanism of dealing with OOM (Out-Of-Memory) errors during allocations | ||
is to manually trigger GC and retry allocating again doing this in several rounds | ||
each more aggressive than previous. | ||
|
||
However, manually triggering GC is very expensive, since it requires scanning | ||
all Julia objects, not just ROCArrays, so the actual memory freeing takes a | ||
fraction of GC time: | ||
![](./assets/gc-vram-breakdown.png) | ||
|
||
On the image above, red region is a call to GC and green region is | ||
where actual GPU memory is being freed. | ||
|
||
--- | ||
|
||
To help with memory management, we can use caching memory allocator. | ||
It is usefult in scenarios where we execute the same function multiple times | ||
and have the same memory allocation pattern. | ||
One such example is training DL models, where given the model and its parameters | ||
we compute loss, gradients w.r.t. loss and perform in-place parameter update. | ||
In this case, every iteration performs same operations and memory allocations | ||
and with caching allocator we can efficiently re-use them without returning | ||
the memory back to OS. | ||
|
||
## Example | ||
|
||
We have a for-loop, where each iteration requires 2 GiB of VRAM. | ||
We create a caching allocator with the name `:loop` and pass a function to | ||
execute. | ||
First iteration will allocate, but subsequent won't. | ||
|
||
```julia | ||
using AMDGPU | ||
|
||
function main() | ||
n = 1024^2 * 256 | ||
for i in 1:1000 | ||
AMDGPU.with_caching_allocator(:loop, n) do n | ||
sin.(AMDGPU.rand(Float32, n)) # 2 GiB allocation | ||
return | ||
end | ||
end | ||
end | ||
``` | ||
|
||
The reason for marking a region of code where to re-use the memory and | ||
not extending it to the whole program instead, is because we cannot rely on GC | ||
to tell us when the memory is no longer used (it is too slow for that), | ||
so we create such region manually. | ||
|
||
You can free all memory held by allocator, by invalidating it using its name | ||
with [`AMDGPU.invalidate_caching_allocator!`](@ref). | ||
Or if you want some region of code within [`AMDGPU.with_caching_allocator`](@ref) | ||
to execute without relying on cache, use [`AMDGPU.with_no_caching`](@ref). | ||
|
||
||Without Caching Allocator|With Caching Allocator| | ||
|:---:|:---:|:---:| | ||
|VRAM Usage|![](./assets/without-caching-allocator.png)|![](./assets/with-caching-allocator.png)| | ||
|Execution time (seconds)|`12.865149`|`0.020943`| | ||
|
||
## API | ||
|
||
```@docs | ||
AMDGPU.with_caching_allocator | ||
AMDGPU.with_no_caching | ||
AMDGPU.invalidate_caching_allocator! | ||
``` |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
# NOTE: EXPERIMENTAL API. | ||
|
||
struct CacheAllocator | ||
lock::ReentrantLock | ||
busy::Dict{UInt64, Vector{ROCArray}} # hash((T, dims)) => ROCArray[] | ||
free::Dict{UInt64, Vector{ROCArray}} | ||
end | ||
|
||
CacheAllocator() = CacheAllocator( | ||
ReentrantLock(), | ||
Dict{UInt64, Vector{ROCArray}}(), | ||
Dict{UInt64, Vector{ROCArray}}(), | ||
) | ||
|
||
const CACHE_ALLOCS::LockedObject{Dict{Symbol, CacheAllocator}} = | ||
LockedObject(Dict{Symbol, CacheAllocator}()) | ||
|
||
function cache_allocator!(cache_name::Symbol) | ||
allocs = CACHE_ALLOCS.payload | ||
alloc = get(allocs, cache_name, nothing) | ||
alloc ≡ nothing || return alloc | ||
|
||
return Base.@lock CACHE_ALLOCS.lock begin | ||
allocs[cache_name] = CacheAllocator() | ||
end | ||
end | ||
|
||
function get_free_pool(alloc::CacheAllocator, uid) | ||
free_pool = get(alloc.free, uid, nothing) | ||
if free_pool ≡ nothing | ||
free_pool = Base.@lock alloc.lock alloc.free[uid] = ROCArray[] | ||
end | ||
return free_pool | ||
end | ||
|
||
function get_busy_pool(alloc::CacheAllocator, uid) | ||
busy_pool = get(alloc.busy, uid, nothing) | ||
if busy_pool ≡ nothing | ||
busy_pool = Base.@lock alloc.lock alloc.busy[uid] = ROCArray[] | ||
end | ||
return busy_pool | ||
end | ||
|
||
function alloc!( | ||
alloc::CacheAllocator, ::Type{Mem.HIPBuffer}, ::Type{T}, dims::Dims{N}, | ||
)::Maybe{ROCArray{T, N, Mem.HIPBuffer}} where {T, N} | ||
uid = hash((T, dims)) | ||
free_pool = get_free_pool(alloc, uid) | ||
isempty(free_pool) && return nothing | ||
|
||
# @info "Cache hit" | ||
busy_pool = get_busy_pool(alloc, uid) | ||
x = pop!(free_pool) | ||
# Array was manually freed via `unsafe_free!`. | ||
x.buf.freed && return nothing | ||
|
||
push!(busy_pool, x) | ||
return x | ||
end | ||
|
||
# Mark `x` array as busy, used during cache misses to add new allocations. | ||
function add_busy!(alloc::CacheAllocator, x::ROCArray{T}) where T | ||
uid = hash((T, size(x))) | ||
busy_pool = get_busy_pool(alloc, uid) | ||
Base.@lock alloc.lock push!(busy_pool, x) | ||
return | ||
end | ||
|
||
function free_busy!(alloc::CacheAllocator) | ||
for uid in alloc.busy.keys | ||
free_pool = get_free_pool(alloc, uid) | ||
busy_pool = get_busy_pool(alloc, uid) | ||
isempty(busy_pool) && continue | ||
|
||
Base.@lock alloc.lock begin | ||
append!(free_pool, busy_pool) | ||
empty!(busy_pool) | ||
end | ||
end | ||
end | ||
|
||
# Public API. | ||
|
||
""" | ||
with_caching_allocator(f, alloc_name::Symbol, args...) | ||
Execute function `f` with arguments `args...` using | ||
caching allocator given by its name `alloc_name`. | ||
All GPU memory allocations will attempt to hit this cache | ||
before doing actual allocation (in case of cache miss). | ||
After executing `f`, all "busy" memory within the allocator is marked as free, | ||
so it can be re-used with the next call. | ||
# Returns | ||
Result of the `f` function. | ||
""" | ||
function with_caching_allocator(f, alloc_name::Symbol, args...) | ||
alloc = cache_allocator!(alloc_name) | ||
# Enable usage of cache allocator during allocations. | ||
cache_alloc_name!(alloc_name) | ||
res = f(args...) | ||
# Mark all allocations during `f` as free to re-use and disable allocator. | ||
free_busy!(alloc) | ||
cache_alloc_name!(:none) | ||
return res | ||
end | ||
|
||
""" | ||
with_no_caching(f) | ||
Execute function `f`, but avoid hitting any caching allocator. | ||
This is useful to call from within [`with_caching_allocator`](@ref), | ||
so that the memory is independent from it. | ||
# Returns | ||
Result of the `f` function. | ||
""" | ||
function with_no_caching(f) | ||
alloc_name = cache_alloc_name() | ||
cache_alloc_name!(:none) | ||
res = f() | ||
cache_alloc_name!(alloc_name) | ||
return res | ||
end | ||
|
||
""" | ||
invalidate_caching_allocator!(alloc_name::Symbol) | ||
Free all memory held by caching allocator given by it name `alloc_name`. | ||
""" | ||
function invalidate_caching_allocator!(alloc_name::Symbol) | ||
alloc = cache_allocator!(alloc_name) | ||
alloc ≡ nothing && return | ||
|
||
Base.@lock alloc.lock begin | ||
for (_, pool) in alloc.free | ||
map(AMDGPU.unsafe_free!, pool) | ||
end | ||
# TODO is other threads use the same, signal that it is invalidated somehow? | ||
# TODO error if pool is in use, i.e. non empty `busy`? | ||
for (_, pool) in alloc.busy | ||
map(AMDGPU.unsafe_free!, pool) | ||
end | ||
empty!(alloc.busy) | ||
empty!(alloc.free) | ||
end | ||
return | ||
end |
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
fba207f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
fba207f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/120794
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via: