Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 - sorry to open this as an issue #738

Open
sm-moshi opened this issue May 2, 2021 · 6 comments
Open

ARM64 - sorry to open this as an issue #738

sm-moshi opened this issue May 2, 2021 · 6 comments

Comments

@sm-moshi
Copy link

sm-moshi commented May 2, 2021

Sorry, for opening this, but I didn't know how to address this.

Does it make sense to use LTO for an ARM64 Raspberry Pi on Gentoo?

@InBetweenNames
Copy link
Owner

Sure, but you might consider cross compiling instead of building on the RPi itself. You can use distcc to do that fairly easily.

@AnonymousRetard
Copy link

AnonymousRetard commented May 3, 2021

Unfortunately I don't think you get any speedups when using distcc with LTO though. It seems like the linking step runs locally and it probably needs to. Otherwise I would assume that every computer part of the distcc cluster needs to have exactly the same libraries.

With LTO most of the compilation time is spent in the linking step.
A pull request to distcc has recently been accepted that stops it from even trying to distribute LTO work:
distcc/distcc#413

Personally I had been wondering for a long time why distcc isn't helping my systems anymore and it's probably because of this. Most of my packages actually seem to compile a bit slower and distribute very badly when trying to use an older version of distcc without that PR merged. Once a new distcc version is released on gentoo with this PR inside the time taken should at least not increase but it won't even have any chance of improving anymore since distcc won't even try to distribute that work.

Since I have LTO enabled system-wide except when I build the kernel it's only the kernel that I see big speed improvements on when using distcc.

This is unfortunately a pretty sad situation because I think it's in smaller and embedded systems that LTO is even more important since it can increase performance and always results in smaller or at least equally sized binaries. So yes it does make a lot of sense to enable it on a raspberry pi, except it increases the compilation time by a lot and you cannot distribute the work. What you would have to do instead is compile the packages with LTO locally on a stronger machine in a build environment for the Raspberry Pi and then distribute the binaries to the Pi. Either that, or accept that it will take a really long time to build all the packages on the RPi itself.

Also for smaller systems I suggest you go with -O2 or -Os instead of -O3 because O3 tends to increase the binary sizes by quite a lot often without actually increasing the performance of them. From my experience from embedded development I have seen -Os sometimes giving the best performance and the smallest binaries (with the latter being the goal of Os, while the former is supposed to be the goal of O3), perhaps because the smaller code more easily fits into the small cache sizes available on such processors. As you can see here some of the more advanced optimization flags enabled by this overlay are sometimes decreasing the performance.
Dropping all the extra advanced optimizations and running only:
-march=native -O2 -flto
Should consistently give the best performance and binary sizes across the board (without any performance loss). But march=native only works if you are compiling locally on the pi.

Apart from taking less space (which you might not have a lot of in smaller system) smaller binaries also have a faster startup-time which is often the dominant problem when running programs with short runtimes. This is especially true if the backing storage is quite slow (like an SD card) and if there's not much RAM available acting as a filesystem cache.

@shelterx
Copy link

shelterx commented May 18, 2021

Yes, the linking is done locally but it also depends on the source code, bigger packages certainly compiles faster using distcc.
llvm went down to ~40 mins from 1h30mins.
The local machine is an i5 laptop, the distcc "server" is an i7 2600k.

EDIT:
Also, O3 is probably not needed (like @AnonymousRetard wrote), it can make some stuff faster but other stuff slower unless the program is specifically written for O3 optimization. So it's really no gain in using O3 as default, it tends to even out in the end anyway.

@AnonymousRetard
Copy link

AnonymousRetard commented May 18, 2021

@shelterx Are you sure you actually built llvm with -flto though?
Because of this issue: #619 -flto is actually stripped from the llvm package in this overlay:
/etc/portage/package.cflags/lto.conf:sys-devel/llvm *FLAGS-=-flto* # Issue #619 temporarily disabled for now due to build errors
This means you would have to have built llvm before this change was added/isn't using lto.conf from this overlay/modified it yourself in order to actually build llvm with LTO.

My output from "emerge --info llvm" also confirms that the CFLAGS & CXXFLAGS don't have -flto present.

When -flto is enabled I think all the code optimization is skipped in the compiling step and instead done during the linking step. This is why the linking step takes so much longer and the distcc helpers can't really help much with the actual compilation steps either. Sending the source code and the results back over the network is likely just slowing down the whole process.

@shelterx
Copy link

@AnonymousRetard Ooops, you are correct and that would explain why I don't see some stuff getting passed to distcc server.
However qtcore is compiled with -flto=auto and it's faster but I agree, it doesn't help THAT much but overall I think you gain more than you lose.
no distcc:
2021-04-13T15:19:03 >>> dev-qt/qtcore: 7′16″
distcc:
2021-05-11T15:14:36 >>> dev-qt/qtcore: 5′39″

here's another example:
no distcc:
2021-05-03T12:11:13 >>> kde-apps/kate: 3′35″
distcc:
2021-05-14T10:37:13 >>> kde-apps/kate: 2′23″

@AnonymousRetard
Copy link

@shelterx This is quite interesting. I might do some of my own tests on these packages later. I have a 4 core weak AMD system used as a server and a strong 16 core 5950X. I don't have specific examples since it's a long time ago I tried this last but I remember being very disappointed in DISTCC performance and actually seeing slowdowns from it on quite a few packages. Very few jobs where being distributed to the 5950X and the majority of the time when building packages was spent compiling things locally. These issues completely disappeared when building packages without -flto but I decided that I rather build stuff locally with LTO than try to speed up the jobs with DISTCC.

This discussion should perhaps continue somewhere else though. The issue tracker on distcc is probably a better place: https://github.com/distcc/distcc

As I mentioned in my original reply a PR has been merged that looks like distcc will soon stop trying to distribute -flto jobs completely so we'll have to raise an issue there if we want to change that behavior in a future release. Perhaps it helps in some cases but not others but I'm not sure if that can be detected automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants