-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU: Make Vulkan transfer buffers dedicated allocs #12256
base: main
Are you sure you want to change the base?
GPU: Make Vulkan transfer buffers dedicated allocs #12256
Conversation
What sort of difference is there in the performance cost of creating or destroying one? |
vkAllocateMemory is an expensive kernel call, but we already recommend designing your renderer to cache resources. Churning resources would be bad either way. |
Given that transfer data isn't usually persistent, would caching the transfer buffers end up basically being a reimplementation of the sort of heap that SDL_GPU does for other types of resources? I'm just imagining putting this directly into my own code and that's probably how it would end up with a bunch of extra work needed on my side, unless I'm missing something easy. Whereas with my native Vulkan code that uses VMA I don't have to worry about that and can just make throwaway transfer buffers for uploading/downloading texture or buffer data to and from the GPU. |
If you're just using EDIT: After skimming some VMA code it looks like they do have a block system for allocations, much like we do. The main difference I guess is that VMA only defragments when you ask it to. Since we're targeting a Metal-esque level of complexity I don't think it's reasonable to force the user to reason about memory compaction. |
Yeah, I completely agree. I just wonder if this change will create that need for users because they'd have to make their own memory allocation scheme to deal with the cost of transfer buffers. |
Let's consider the use cases.
I'm not really seeing where it would be necessary or less troublesome to be churning transfer buffers constantly. It's good practice in general not to touch the driver anyway. |
So one of the things I have in my app is multiple (one per font+size combination) texture atlases for font glyphs. Those atlases get populated with rasterized glyphs on demand during the app's runtime when there's a glyph that's needed which is missing from the atlas. The cost of uploading on-demand is expected to be fairly minimal, low enough that stutters aren't going to be a real problem. So far I have not needed any higher level transfer buffer memory pool for that (especially since I use VMA with my Vulkan backend). If the cost in practice of making transfer buffers in SDL_GPU use dedicated memory with is low enough, then I still won't need to if I use SDL_GPU. This sort of thing is why I'm interested in real measurements rather than theoretical numbers. I also have recent experience with small dedicated allocations in D3D12 on Windows being very slow, enough that the engine which was doing that had performance issues compared to D3D11 just because of those allocations. It's always good to measure this sort of thing in my opinion. |
Yeah, I'll be running the FNA3D traces tomorrow... we cap our cycles but we also have a separate path for abnormally large transfers, so I could see it getting weird with a big enough test. Will try to confirm soon! |
Our Vulkan memory allocation strategy is to do large heap allocations at a time (64MB or larger) and then slice out sections of them for resource binding. This means we have to defragment regularly or else holes will appear in the allocations.
The defrag process creates a new buffer on an available unfragmented allocation and then preserves data by enqueueing a copy command to preserve the data. We do the copy for GPU buffers, but we skip it for transfer buffers. This causes an issue where persistently-used upload transfer data is corrupted. An easy solution for this is to do what we did for download transfer buffers and make them dedicated allocations, meaning that they live alone on a heap allocation.
The only drawback is the
maxMemoryAllocationCount
limit. The most common driver limit here is 4096. Honestly, if you have that many transfer buffers you definitely need to redesign your renderer.The alternative would be to preserve transfer contents in the defrag process. Since in the typical case transfer data does not need to be persistent, I think this would just be a waste of processing time.
Resolves #12242