CFLAGS Performance Evaluation #431

wolfwood · 2019-10-27T22:22:32Z

I got hung up in my work on #288 with 2 issues. what are representative benchmarks, and how do I evaluate the flags selections that iRace outputs to compare to some baseline and decide if its 'improving' things.

I decided take a break from the flag learning experiments to simply try running phoronix-test-suite tests after rebuilding my system with various CFLAGS configurations. results were a bit indecipherable unfortunately, but you can see them uploaded here:
my results

I got an ebuild for PTS from @bobwya's overlay, and made a package set of my own for various test dependencies (still haven't gotten the tensorflow stuff to build right tho)
phoronix-set.txt which you should copy to /etc/portage/sets/phoronix and emerge -av @phoronix to get all the necessary deps for the phoronix tests on gentoo,

started out trying to test with a virtual suite for all valid linux tests, which were estimated at 1 month runtime. I then cut this down to about a 7 day test set. then I got sick and tired of watching it test every single combination of resolution and track on Super Tuxkart for days on end and ejected all graphical testing completely (yes I regret this a tad).

I ended up with a set that runs in a bit under a day for me. you can try it at home as 1910276-HV-LTOIZE49411

[skipping GPU testing but keeping a lot of disk tests is non-ideal as I'd expect compile flags to have less effect on storage workloads than compute or GPU bound ones, but the results aren't flat in all cases and I'm having trouble telling if the disk testing is just very noisy or if its actually meaningful data.]

I then wrote a tool to force install and run tests with my current *FLAGS, from emerge --info, and MPI environmental variables, which I called pts (make sure to add FCFLAGS="${CFLAGS}" and FFLAGS="${CFLAGS}" to your make.conf because some tests are fortran)
pts.txt

I then wrote a wrapper script to take a password, run sed on my make.conf (overwrites your CFLAGS! comment them out to save a copy), reemerge my system and then run my pts script
eval.sh.txt

Finally I wrote a wrapper to try all combinations of O2/O3, LTO and GRAPHITE
harness.pl.txt

I think the next steps in this work are to move to the newest 9.x line of PTS, further remove tests that don't give meaningful results and then try again with ${DEVIRTLTO} ${IPAPTA} ${SEMINTERPOS} and -falign-functions=24/32/64/xxx

mostly just throwing this up there to see if anyone else can make any sense of the PTS graphs / is interest in running on their own hardware.

The text was updated successfully, but these errors were encountered:

wolfwood · 2019-10-27T23:54:36Z

N.B. -O2 actually means -O2 -ftree-vectorize, since this is what iRace keep picking over -O3.

also, here is a script for identifying which benchmarks take the most time, by parsing your ~/.phoronix-test-suite/phoronix-test-suite-benchmark.log file
log.pl.txt

these are my worst offenders:

pts/pts-self-test-1.0.4 565
pts/llvm-test-suite-1.0.0 587
pts/system-libxml2-1.0.3 621
pts/ramspeed-1.4.2 645
pts/compress-lzma-1.3.1 649
pts/hackbench-1.0.0 677
pts/parboil-1.2.1 719
pts/redis-1.1.0 832
pts/tinymembench-1.0.2 833
pts/radiance-1.0.0 859
pts/compress-rar-1.1.0 881
pts/apache-siege-1.0.4 882
pts/mbw-1.0.0 1169
pts/iozone-1.9.5 1173
pts/cachebench-1.1.2 1180
pts/blogbench-1.1.0 1222
pts/graphics-magick-1.8.0 1389
pts/byte-1.2.1 1513
pts/hint-1.0.2 2237
pts/numpy-1.0.5 2258
pts/fftw-1.2.0 3497
pts/tiobench-1.3.1 4253
pts/cpp-perf-bench-1.0.0 4584
pts/schbench-1.0.0 7878
pts/blender-1.4.1 8019
pts/dbench-1.0.0 13072

wolfwood · 2019-10-29T17:11:30Z

easier to look at when I check 'normalize results'
https://openbenchmarking.org/result/1910276-HV-LTOIZE49411&obr_shm=y&obr_sgm=y&obr_swl=y&obr_sro=y&obr_nor=y&obr_vb=y&obr_imw=y

elsandosgrande · 2019-10-29T17:57:09Z

Next week might be an online week for my school, so I might have enough time to look into this, but only benchmarking my current system.

barolo · 2019-11-17T15:30:00Z

@wolfwood Sooo..... LTO GRAPHITE is both worst and best, am I reading it right?

wolfwood · 2019-11-17T18:03:39Z

That is one way to read it, certainly.

The part that that framing misses is that in some cases it's last place by 0.02% and sometimes it's faster by a double digit percent.

The harmonic means attempt to account for that but mostly say it's a wash.

I think winnowing down the tests to remove the noisiest ones should make the harmonic means (and the number of wins/losses) less arbitrary.

Also removing tests that show little variation across results would magnify the variation in the remaining ones, but those are also good canaries for detecting regressions.

Not exactly sure how to analyze the raw data and maker those kinds of determinations, though.

jiblime · 2019-11-17T20:08:22Z

Would it be possible to create prefixes instead of rebuilding the system and run it from there, or would the host system influence the prefix in testing?

If you do get a chance to post a lighter set without what you think is essential, or benchmark tests separated out, I'd really appreciate that.

wolfwood · 2019-11-17T21:45:50Z

i don't have experience using prefixes, but my phoronix set plus pts itself should be all you need.

for me the testing takes much longer that the world rebuild, but i guess you're right it's not all necessary.

jiblime · 2019-11-19T00:08:11Z

Right. my main concern is that my processor would likely take +3 days compared to your beefy CPU for the full bench, and I'm sure there are others who'd be much more willing to do the benchmark if it were time feasible. The issue with that is if the tests aren't comprehensive enough the results may end up being pointless.

for me the testing takes much longer that the world rebuild, but i guess you're right it's not all necessary.

I wouldn't doubt it, since a lot of tests are repeated for deviation. The problem is for some systems, a full world build would take >24hr, just by itself. Perhaps it would be ideal to setup a stage 3 tarball with the bare essentials to install on a separate partition to reduce the need for a full @world rebuild with all the fun stuff that benchmarking doesn't use.

If it wouldn't be too much work, can you explain why such tests were chosen? What about these particular tests did you think made them important to include?

Edit: Sorry, I know I am asking a lot of simple questions trying to figure this out.

wolfwood · 2019-11-19T02:00:02Z

@jiblime I started with every phoronix test that supported linux and would install, then dropped the ones that depended on X (somewhat arbitrarily as I said) and then dropped a few that took a very long time. This is not a curated set of tests, rather I'm looking for help in producing one.

wolfwood · 2019-11-19T02:04:01Z

also, I'm trying out setting up a prefix. If nothing else I'll feel a lot better not having to install php and mongodb on in my host system :P

jiblime · 2019-11-19T06:45:57Z

Good to know! I'll come up with my own (small) set and hopefully someone here can chime in as to what needs to be fixed for coverage and consistency.

About the prefix: you can do a bash bootstrap, but it's needless since the level 2 IIRC will be rebuilding the whole system anyways. So the bootstrap-prefix.sh would be preferred. It also works with zsh but I'd set your shell to bash to prevent any odd occurrences. For me, it told me using an 'obscure shell' caused the prefix to completely fail at the very end. I just ran it again and it was fine.

Notes about setting up the prefix:

Installing ltoize won't work because the symlinks are misunderstood by the prefix. I suggest copying them with cp -L to your prefix's /etc/portage directory
Executing candomultilib=no ./bootstrap-prefix.sh probably makes it no-multilib if you want a faster prefix to test around with, I forgot because I tarballed mine and put it somewhere. but I think testing 32bit libraries would be equally as important because of how could can be optimized differently
I inject USE flags such as graphite at later stages so I won't have to do basically redo the stage 3 system rebuild. Note that adding any flags at stage 1/2 will probably cause it to fail. I sometimes add EMERGE_DEFAULT_OPTS to pass emerge arguments to the bootstrap script. The most common things I add to the default opts are --ask --verbose or --resume if I was messing around with it.

wolfwood mentioned this issue Nov 14, 2019

add a static keyword to python3 #439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFLAGS Performance Evaluation #431

CFLAGS Performance Evaluation #431

wolfwood commented Oct 27, 2019 •

edited

Loading

wolfwood commented Oct 27, 2019

wolfwood commented Oct 29, 2019

elsandosgrande commented Oct 29, 2019

barolo commented Nov 17, 2019

wolfwood commented Nov 17, 2019 •

edited

Loading

jiblime commented Nov 17, 2019

wolfwood commented Nov 17, 2019

jiblime commented Nov 19, 2019 •

edited

Loading

wolfwood commented Nov 19, 2019

wolfwood commented Nov 19, 2019

jiblime commented Nov 19, 2019 •

edited

Loading

CFLAGS Performance Evaluation #431

CFLAGS Performance Evaluation #431

Comments

wolfwood commented Oct 27, 2019 • edited Loading

wolfwood commented Oct 27, 2019

wolfwood commented Oct 29, 2019

elsandosgrande commented Oct 29, 2019

barolo commented Nov 17, 2019

wolfwood commented Nov 17, 2019 • edited Loading

jiblime commented Nov 17, 2019

wolfwood commented Nov 17, 2019

jiblime commented Nov 19, 2019 • edited Loading

wolfwood commented Nov 19, 2019

wolfwood commented Nov 19, 2019

jiblime commented Nov 19, 2019 • edited Loading

wolfwood commented Oct 27, 2019 •

edited

Loading

wolfwood commented Nov 17, 2019 •

edited

Loading

jiblime commented Nov 19, 2019 •

edited

Loading

jiblime commented Nov 19, 2019 •

edited

Loading