Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU: Make Vulkan transfer buffers dedicated allocs #12256

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

thatcosmonaut
Copy link
Collaborator

@thatcosmonaut thatcosmonaut commented Feb 10, 2025

Our Vulkan memory allocation strategy is to do large heap allocations at a time (64MB or larger) and then slice out sections of them for resource binding. This means we have to defragment regularly or else holes will appear in the allocations.

The defrag process creates a new buffer on an available unfragmented allocation and then preserves data by enqueueing a copy command to preserve the data. We do the copy for GPU buffers, but we skip it for transfer buffers. This causes an issue where persistently-used upload transfer data is corrupted. An easy solution for this is to do what we did for download transfer buffers and make them dedicated allocations, meaning that they live alone on a heap allocation.

The only drawback is the maxMemoryAllocationCount limit. The most common driver limit here is 4096. Honestly, if you have that many transfer buffers you definitely need to redesign your renderer.

The alternative would be to preserve transfer contents in the defrag process. Since in the typical case transfer data does not need to be persistent, I think this would just be a waste of processing time.

Resolves #12242

@slime73
Copy link
Contributor

slime73 commented Feb 10, 2025

What sort of difference is there in the performance cost of creating or destroying one?

@thatcosmonaut
Copy link
Collaborator Author

vkAllocateMemory is an expensive kernel call, but we already recommend designing your renderer to cache resources. Churning resources would be bad either way.

@slime73
Copy link
Contributor

slime73 commented Feb 10, 2025

Given that transfer data isn't usually persistent, would caching the transfer buffers end up basically being a reimplementation of the sort of heap that SDL_GPU does for other types of resources?

I'm just imagining putting this directly into my own code and that's probably how it would end up with a bunch of extra work needed on my side, unless I'm missing something easy. Whereas with my native Vulkan code that uses VMA I don't have to worry about that and can just make throwaway transfer buffers for uploading/downloading texture or buffer data to and from the GPU.

@thatcosmonaut
Copy link
Collaborator Author

thatcosmonaut commented Feb 10, 2025

If you're just using vmaCreateBuffer all that's doing under the hood is calling vkAllocateMemory. You would have to use a more complex allocation strategy to avoid that.

EDIT: After skimming some VMA code it looks like they do have a block system for allocations, much like we do. The main difference I guess is that VMA only defragments when you ask it to. Since we're targeting a Metal-esque level of complexity I don't think it's reasonable to force the user to reason about memory compaction.

@slime73
Copy link
Contributor

slime73 commented Feb 10, 2025

Since we're targeting a Metal-esque level of complexity I don't think it's reasonable to force the user to reason about memory compaction.

Yeah, I completely agree. I just wonder if this change will create that need for users because they'd have to make their own memory allocation scheme to deal with the cost of transfer buffers.

@thatcosmonaut
Copy link
Collaborator Author

thatcosmonaut commented Feb 10, 2025

Let's consider the use cases.

  1. You have to do some uploads every frame, like in the case of a sprite batcher. In this case you should just be hanging on to a transfer buffer object that manages the uploads. When you map the transfer buffer you set the cycle parameter to true to avoid having to manually pool transfers.

  2. You are loading content at the start of your game. In this case you would create a transfer buffer, perform your uploads, and tear it down. This is heavy anyway so the extra overhead of creating a transfer buffer doesn't really matter.

  3. You are reading back data every frame. In this case you would have one persistent transfer buffer per-frame in flight.

  4. You are reading back data sporadically, like saving a screenshot. In this case you would hold on to a persistent transfer buffer that's big enough to contain the screenshot buffer. You could also just set up and tear down a transfer buffer on a thread since presumably you would want the screenshot saving process to be asynchronous.

I'm not really seeing where it would be necessary or less troublesome to be churning transfer buffers constantly. It's good practice in general not to touch the driver anyway.

@slime73
Copy link
Contributor

slime73 commented Feb 11, 2025

So one of the things I have in my app is multiple (one per font+size combination) texture atlases for font glyphs. Those atlases get populated with rasterized glyphs on demand during the app's runtime when there's a glyph that's needed which is missing from the atlas. The cost of uploading on-demand is expected to be fairly minimal, low enough that stutters aren't going to be a real problem.

So far I have not needed any higher level transfer buffer memory pool for that (especially since I use VMA with my Vulkan backend). If the cost in practice of making transfer buffers in SDL_GPU use dedicated memory with is low enough, then I still won't need to if I use SDL_GPU. This sort of thing is why I'm interested in real measurements rather than theoretical numbers.

I also have recent experience with small dedicated allocations in D3D12 on Windows being very slow, enough that the engine which was doing that had performance issues compared to D3D11 just because of those allocations. It's always good to measure this sort of thing in my opinion.

@flibitijibibo
Copy link
Collaborator

Yeah, I'll be running the FNA3D traces tomorrow... we cap our cycles but we also have a separate path for abnormally large transfers, so I could see it getting weird with a big enough test. Will try to confirm soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU transfer buffer contains data from another buffer after defrag
3 participants