-
Notifications
You must be signed in to change notification settings - Fork 63
perf stat recipies
Miscellaneous notes on perf
and especially perf stat
and related functions that count or sample based on hardware performance counters.
You should probably avoid the builtin hardware events like L1-dcache-load-misses
since they have questionable definitions.
Undocumented event=0x48 umask=0x4 seems to count "fill buffer allocations", i.e., the number of times a fill buffer was allocated for any type of request (demand load, store, prefetch, L1 dirty line writeback, etc). Works on SKL and SKX but not CNL. More details at RWT.
The L2_RQSTS
events are good, but the description the manual is confusing and incorrect in places. For Skylake and derived archs (and probably Haswell), the functionality offered is fairly simple: every completed request is has two attributes: its origin (where the request came from) and the result (did the request hit in the cache). There are 5 possible origins and 3 possible results, and the umask filters for any specified combination of result AND origin.
The 3 possible results are encoded in bits 5-7 (3 most significant bits of the origin) as follows:
umask bit | result type | notes |
---|---|---|
0x80 |
L2 hit M-state | The prior state of the line was M |
0x40 |
L2 hit E/S-state | The prior state of the line was E/S1 |
0x20 |
L2 miss | The line was not in L2 |
The 5 possible origins of L2 requests are encoded in bits 0-4 (the least significant 5 bits) as follows:
umask bit | origin | notes |
---|---|---|
0x01 |
Demand read requests | Demand read requests originating from the core (does not include SW prefetch) |
0x02 |
RFO requests | RFOs originating from stores in the core - such as "blind stores" without a read first, and software RFO prefetches like prefetchw
|
0x04 |
Instruction reads | Reads originating from misses in the L1I cache |
0x08 |
L1 prefetch requests | Requests originating from L1 HW prefetcher or software load prefetch requests (but not store SW prefetches like prefetchw , and possibly also NPP2) |
0x10 |
L2 HW prefetcher | Requests originating from within the L2 HW prefetcher itself |
The masks can be combined in any way, and all events that match any of the selected origins and any of the selected results be counted. This means that you always need to include at least one origin bit and one result bit, or else the result is always zero.
For example, to count just demand data loads that miss, use 0x01 | 0x20 == 0x21
. To count all misses of any type use 0x20 | 0x1F
. To count all RFO requests regardless of the result 0xE0 | 0x02
, and so on.
One might expect that any demand store that misses in L1 will generate either an L2_RQSTS.RFO_MISS
or L2_RQSTS.RFO_HIT
event. However, you can easily write a simple loop that will show a much lower count. The issue seems to be: stores that miss in L1 but hit an outstanding L2 prefetch request in L2 don't count either event. They only count in events that include the 0x10
umask bit which counts requests originating from the L2 prefetcher. For a long stream of stores that tend to hit prefetcher originated requests, you'll usually see event with umask 0xF2
(i.e., "ALL_RFO
including L2 PF") have 2x the count of the number of missed stores: one event is generated by the L2 PF that starts fetching the line, and a second by the store request that hits the request in progress.
Originally reported on stackoverflow and more details in this answer.
Events from the manual
** This section is left here for historical purposes. Intel have fixed the umasks on the events marked "Wrong" below in their newest event JSON files downloads on 01.org. Those files aren't versioned, so I can't tell when the update happened, but it may have been around mid-March based on this change.
Now that we know how that all works, we can observe the events in the manual are simple various selected combinations of the above bits, some with misleading or incomplete documentation. I'll reproduce the "named" events here with comments if anything is confusing. Note that the table use the notation E1H
for 0xE1
unlike the rest of this document.
events | umask | SDM name (with L2_RQSTS prefix) | SDM description | Actual |
---|---|---|---|---|
24H | 21H | DEMAND_DATA_RD_MISS | Demand Data Read requests that missed L2, no rejects. | Correct |
24H | 22H | RFO_MISS | RFO requests that missed L2. | Correct. Note that this includes SW RFO prefetches like prefetchw . Note that it excludes demand RFOs that hit a pending L2 prefetch request as described above. |
24H | 24H | CODE_RD_MISS | L2 cache misses when fetching instructions. | Correct |
24H | 27H | ALL_DEMAND_MISS | Demand requests that missed L2. | Correct for some definition of "demand". This event is simply the sum of the three events above: so it includes instruction fetch/prefetch misses. |
24H | 38H | PF_MISS | Requests from the L1/L2/L3 hardware prefetchers or load software prefetches that miss L2 cache. | Seems correct except for mention of L3 HW prefetcher, the implicit exclusion of store SW prefetch (like prefetchw ) is correct |
24H | 3FH | MISS | All requests that missed L2. | Correct |
24H | 41H | DEMAND_DATA_RD_HIT | Demand Data Read requests that hit L2 cache. | Wrong because it only includes hits where the line was in S/E state, not M. Should be umask=C1H . This is corrected in the March 2019 01.org downloads.
|
24H | 42H | RFO_HIT | RFO requests that hit L2 cache. | Wrong for same reason as above, should be C2H - this is corrected in newest 01.org event downloads. |
24H | 44H | CODE_RD_HIT | L2 cache hits when fetching instructions. | More or less correct - in principle this excludes M lines as above, but that event is vanishingly small in most applications. This is corrected to include M lines in the March 2019 01.org downloads. |
24H | D8H | PF_HIT | Prefetches that hit L2. | Oddly, this one is correct as it includes M lines |
24H | E1H | ALL_DEMAND_DATA_RD | All demand data read requests to L2. | Correct |
24H | E2H | ALL_RFO | All L RFO requests to L2. | Correct, although I don't know what the L in "L RFO" is. Note that it excludes demand RFOs that hit a pending L2 prefetch request as described above. |
24H | E4H | ALL_CODE_RD | All L2 code requests. | Correct (note this one does include M-state code lines) |
24H | E7H | ALL_DEMAND_REFERENCES | All demand requests to L2. | Correct, as long as your definition of "demand" includes code fetches, and SW RFO prefetches |
24H | F8H | ALL_PF | All requests from the L1/L2/L3 hardware prefetchers or load software prefetches. | Correct |
24H | EFH | REFERENCES | All requests to L2. |
All requests would probably be umask=FFH - this event is specifically excluding 10H which is L2 prefetch requests, so it does not include requests originating within the L2 prefetcher itself: perhaps better named "all incoming" requests or something like that. |
AWS EC2 counters described in this tweet.
Falk-style (Gamozo Labs) style event capture: https://github.com/b-shi/PMC-PMI
1 I haven't actually jumped through the hoops to carefully test that both the E and S state are covered, just that non-M lines fall into this category - but it seems very likely that would be the case. It can be quite hard to actually ensure you have a line in the E state (as opposed to S) since the cache may decide to bring it in either state depending on opaque heuristics.
2 NPP is the next-page prefetcher about which little information is available (some limited and empirical observations can be found here). Even when all other prefetchers are disabled, I observe one umask=0x80
event for every accessed page, so perhaps the NPP makes some type of request to the L2 which is flagged in this category.