Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFLAGS Performance Evaluation #431

Open
wolfwood opened this issue Oct 27, 2019 · 11 comments
Open

CFLAGS Performance Evaluation #431

wolfwood opened this issue Oct 27, 2019 · 11 comments

Comments

@wolfwood
Copy link
Contributor

wolfwood commented Oct 27, 2019

I got hung up in my work on #288 with 2 issues. what are representative benchmarks, and how do I evaluate the flags selections that iRace outputs to compare to some baseline and decide if its 'improving' things.

I decided take a break from the flag learning experiments to simply try running phoronix-test-suite tests after rebuilding my system with various CFLAGS configurations. results were a bit indecipherable unfortunately, but you can see them uploaded here:
my results

I got an ebuild for PTS from @bobwya's overlay, and made a package set of my own for various test dependencies (still haven't gotten the tensorflow stuff to build right tho)
phoronix-set.txt which you should copy to /etc/portage/sets/phoronix and emerge -av @phoronix to get all the necessary deps for the phoronix tests on gentoo,

started out trying to test with a virtual suite for all valid linux tests, which were estimated at 1 month runtime. I then cut this down to about a 7 day test set. then I got sick and tired of watching it test every single combination of resolution and track on Super Tuxkart for days on end and ejected all graphical testing completely (yes I regret this a tad).

I ended up with a set that runs in a bit under a day for me. you can try it at home as 1910276-HV-LTOIZE49411

[skipping GPU testing but keeping a lot of disk tests is non-ideal as I'd expect compile flags to have less effect on storage workloads than compute or GPU bound ones, but the results aren't flat in all cases and I'm having trouble telling if the disk testing is just very noisy or if its actually meaningful data.]

I then wrote a tool to force install and run tests with my current *FLAGS, from emerge --info, and MPI environmental variables, which I called pts (make sure to add FCFLAGS="${CFLAGS}" and FFLAGS="${CFLAGS}" to your make.conf because some tests are fortran)
pts.txt

I then wrote a wrapper script to take a password, run sed on my make.conf (overwrites your CFLAGS! comment them out to save a copy), reemerge my system and then run my pts script
eval.sh.txt

Finally I wrote a wrapper to try all combinations of O2/O3, LTO and GRAPHITE
harness.pl.txt

I think the next steps in this work are to move to the newest 9.x line of PTS, further remove tests that don't give meaningful results and then try again with ${DEVIRTLTO} ${IPAPTA} ${SEMINTERPOS} and -falign-functions=24/32/64/xxx

mostly just throwing this up there to see if anyone else can make any sense of the PTS graphs / is interest in running on their own hardware.

@wolfwood
Copy link
Contributor Author

N.B. -O2 actually means -O2 -ftree-vectorize, since this is what iRace keep picking over -O3.

also, here is a script for identifying which benchmarks take the most time, by parsing your ~/.phoronix-test-suite/phoronix-test-suite-benchmark.log file
log.pl.txt

these are my worst offenders:

pts/pts-self-test-1.0.4 565
pts/llvm-test-suite-1.0.0 587
pts/system-libxml2-1.0.3 621
pts/ramspeed-1.4.2 645
pts/compress-lzma-1.3.1 649
pts/hackbench-1.0.0 677
pts/parboil-1.2.1 719
pts/redis-1.1.0 832
pts/tinymembench-1.0.2 833
pts/radiance-1.0.0 859
pts/compress-rar-1.1.0 881
pts/apache-siege-1.0.4 882
pts/mbw-1.0.0 1169
pts/iozone-1.9.5 1173
pts/cachebench-1.1.2 1180
pts/blogbench-1.1.0 1222
pts/graphics-magick-1.8.0 1389
pts/byte-1.2.1 1513
pts/hint-1.0.2 2237
pts/numpy-1.0.5 2258
pts/fftw-1.2.0 3497
pts/tiobench-1.3.1 4253
pts/cpp-perf-bench-1.0.0 4584
pts/schbench-1.0.0 7878
pts/blender-1.4.1 8019
pts/dbench-1.0.0 13072

@wolfwood
Copy link
Contributor Author

@elsandosgrande
Copy link
Contributor

Next week might be an online week for my school, so I might have enough time to look into this, but only benchmarking my current system.

@barolo
Copy link

barolo commented Nov 17, 2019

@wolfwood Sooo..... LTO GRAPHITE is both worst and best, am I reading it right?

@wolfwood
Copy link
Contributor Author

wolfwood commented Nov 17, 2019

That is one way to read it, certainly.

The part that that framing misses is that in some cases it's last place by 0.02% and sometimes it's faster by a double digit percent.

The harmonic means attempt to account for that but mostly say it's a wash.

I think winnowing down the tests to remove the noisiest ones should make the harmonic means (and the number of wins/losses) less arbitrary.

Also removing tests that show little variation across results would magnify the variation in the remaining ones, but those are also good canaries for detecting regressions.

Not exactly sure how to analyze the raw data and maker those kinds of determinations, though.

@jiblime
Copy link
Contributor

jiblime commented Nov 17, 2019

Would it be possible to create prefixes instead of rebuilding the system and run it from there, or would the host system influence the prefix in testing?

If you do get a chance to post a lighter set without what you think is essential, or benchmark tests separated out, I'd really appreciate that.

@wolfwood
Copy link
Contributor Author

i don't have experience using prefixes, but my phoronix set plus pts itself should be all you need.

for me the testing takes much longer that the world rebuild, but i guess you're right it's not all necessary.

@jiblime
Copy link
Contributor

jiblime commented Nov 19, 2019

Right. my main concern is that my processor would likely take +3 days compared to your beefy CPU for the full bench, and I'm sure there are others who'd be much more willing to do the benchmark if it were time feasible. The issue with that is if the tests aren't comprehensive enough the results may end up being pointless.

for me the testing takes much longer that the world rebuild, but i guess you're right it's not all necessary.

I wouldn't doubt it, since a lot of tests are repeated for deviation. The problem is for some systems, a full world build would take >24hr, just by itself. Perhaps it would be ideal to setup a stage 3 tarball with the bare essentials to install on a separate partition to reduce the need for a full @world rebuild with all the fun stuff that benchmarking doesn't use.

If it wouldn't be too much work, can you explain why such tests were chosen? What about these particular tests did you think made them important to include?

Edit: Sorry, I know I am asking a lot of simple questions trying to figure this out.

@wolfwood
Copy link
Contributor Author

@jiblime I started with every phoronix test that supported linux and would install, then dropped the ones that depended on X (somewhat arbitrarily as I said) and then dropped a few that took a very long time. This is not a curated set of tests, rather I'm looking for help in producing one.

@wolfwood
Copy link
Contributor Author

also, I'm trying out setting up a prefix. If nothing else I'll feel a lot better not having to install php and mongodb on in my host system :P

@jiblime
Copy link
Contributor

jiblime commented Nov 19, 2019

Good to know! I'll come up with my own (small) set and hopefully someone here can chime in as to what needs to be fixed for coverage and consistency.

About the prefix: you can do a bash bootstrap, but it's needless since the level 2 IIRC will be rebuilding the whole system anyways. So the bootstrap-prefix.sh would be preferred. It also works with zsh but I'd set your shell to bash to prevent any odd occurrences. For me, it told me using an 'obscure shell' caused the prefix to completely fail at the very end. I just ran it again and it was fine.

Notes about setting up the prefix:

  • Installing ltoize won't work because the symlinks are misunderstood by the prefix. I suggest copying them with cp -L to your prefix's /etc/portage directory
  • Executing candomultilib=no ./bootstrap-prefix.sh probably makes it no-multilib if you want a faster prefix to test around with, I forgot because I tarballed mine and put it somewhere. but I think testing 32bit libraries would be equally as important because of how could can be optimized differently
  • I inject USE flags such as graphite at later stages so I won't have to do basically redo the stage 3 system rebuild. Note that adding any flags at stage 1/2 will probably cause it to fail. I sometimes add EMERGE_DEFAULT_OPTS to pass emerge arguments to the bootstrap script. The most common things I add to the default opts are --ask --verbose or --resume if I was messing around with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants