-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autopar/offloading etc #158
Comments
I'm interested in this! I can see auto-vectorization being a clear winner in many situations (especially with the appropriate ISA), but it had never occurred to me to enable auto-parallelization as well system wide. For the uninitiated, auto-vectorization uses the available SIMD lanes on your processor to run multiple iterations of your loops in parallel, and auto-parallelization will attempt to create multiple threads that correspond to different iterations of your loops. They are not mutually exclusive -- See any modern OpenCL CPU implementation where both are used. The notion of auto-offload is interesting as well. AFAIK, auto-vectorization is already performed by GCC at the optimization levels supported by this overlay, but auto-parallelization and auto-offload are not. It could be interesting to add a flag to |
@InBetweenNames Presumably you're using the proprietary driver for your 1080 Ti given it isn't reclocked by Nouveau AFAIK. In that case, you should be able to test offloading quite easily. Just remember to set eclass-overrides in repos.conf to use the toolchain.ecass in gentoo-gpu. Then you should be able to just set the offload-nvptx USE flag when emerging sys-devel/gcc (7 or 8). I would test with something supporting OpenMP or OpenACC first before rebuilding @world with autopar! ;-) By the way, enabling autopar system-wide doesn't have too much of negative consequence in the case where no advantage is gained since Gentoo has -as-needed set durinig linking libgomp and libpthread get dropped if they aren't used. This has caused less issues than another system-wide experiment I've had some success with; namely linking everything with jemalloc! I do need to try AutoFDO... |
I'll give it a shot after I investigate the AudoFDO stuff more for sure. Browsing the GCC documentation, I found this:
I'm a bit concerned about the above as it doesn't check if it's profitable to parallelize the loops, it just does it if the loop has no loop carried dependencies. Does the cost function kick in somewhere else? I'm hesitant to enable any options that don't have a cost function associated with them. Part of the reason I enable Graphite auto-vectorization is that it does do cost analysis on the code generation. |
I thought that at first, and I believe it is part of the reason why very few have used it. As described in the link I provided above: "You can trigger it by 2 flags -floop-parallelize-all -ftree-parallelize-loops=4. Both of them are needed, the first flag will trigger Graphite pass to mark loops that can be parallel and the second flag will trigger the code generation part." So, the GCC documentation is misleading since -floop-parallelize-all doesn't parallelize loops at all, but mark loops that can be parallelized during a later pass. Here's a full quote from the link with a few comments:
This means all code that is successfully parallelized needs to be linked with libgomp, I do this by using the -fopenmp compiler flag. As I mentioned -Wl,--as-needed drops libgomp where no parallelization takes place.
This was a long time ago, I think the the GCC documentation might date back to this, I haven't checked though.
I'm sure there are still plenty of cases where the overhead is greater than the gain, but the gain can be significant:
I'm certain if more use was made of autopar it would get worked on more. It's getting ever more useful with multi-core and HSA. |
I agree the benefits should be very significant when the optimization is applied in the right places, but reading this paper states that these kind of heuristics have yet to be developed. Granted, that was from 2011. I'm about to start reading the GCC sources to see if any work has been done since then on this aspect, as it is critical. From the paper:
From what you quoted, it sounds like it needs profile information to determine if it should be applied? In either case, what you quoted suggests that the cost function is in |
I don't think the cost model is the main issue with autopar currently. I did a little experimenting with nbench, it generally decides not to parallelize the tests but where it did made the timing function multiply by number of threads. I added support for POSIX clock_gettime() which fixed the timer calculation but I see no gain at all, just slightly slower as number of threads rises which is weird. It's like it's still only counting the iterations of 1 thread which is slowed by overhead. Maybe the iteration count becomes thread local and doesn't get incremented correctly? I'm going to look at the assembly and maybe make a new GCC bug. I've been reading a little of the GCC Bugzilla and it seems fixes/improvements are planned for gcc-9. Perhaps an idea to wait until then? :-) |
Yeah, I'm in favour of leaving this issue open for more discussion/insight for sure, as with the right application this could be a big win. Please keep us updated on what you find! :) Also, feel free to post relevant GCC bugzilla threads here. We have a few power users who are very interested in this stuff. |
I'm missing the obvious. nbench is measuring the time taken accumulatively to perform the operation, not work done. Presumably this means nbench will also give the wrong result for vectorized code. I'm going to see if I can modify it to give meaningful results with modern compilers without compromising the benchmark too much. |
It case I was unclear what I mean is autopar is working correctly, but the code explicitly measures the time taken to perform the test and accumulates that measurement for each run whether parallel or not. So in the end it measures the thread throughput which you would expect to decline as additional threads are added if only because the cpu runs a faster clock with less load. |
Update: In case anybody wants to try this with AMD offloading, the AMDGPU GCC backend isn't there yet. I was overestimating the amount of progress which had been made and it doesn't yet support being used as a offload accelerator. Supported NVidia devices and Intel MIC devices should work though. |
In addition to LTO/Graphite, I also build with Auto-Parallelisation where possible. I've converted my own custom flag management hacks over to gentooLTO including reworking my auto-prelink portage hook script to a portage-bashrc-mv drop-in.
In the gentoo-gpu overlay I'm maintaining a modified toolchain.eclass with offload support which enables building offload-gcc-* plugin ebuilds to enable OpenMP/OpenACC to utilise compute hardware such as GPUs and Intel MIC devices. Recently AMD has announced a plugin for ROCm. This could potentially dovetail quite nicely with autopar.
Sadly, I don't have the hardware to test offloading. It requires a relatively recent NVidia GPU (at least Fermi and earlier are not supported), ROCm requires either a PCIe port with v3 Atomics or a Vega with PCIe v2, while I have a POLARIS10 with PCIe v2!!
If anybody daring enough could test emerging a few packages with autopar+offloading enabled that would be pretty cool and give me an incentive to keep up the maintenance. Any takers?
The text was updated successfully, but these errors were encountered: