-
-
Notifications
You must be signed in to change notification settings - Fork 437
Using PAPI on Intel Processors
On Aug 31, 2016, at 6:42 PM, Stephane Eranian [email protected] wrote:
Hi Phil,
On Tue, Aug 30, 2016 at 3:31 AM, Philip Mucci [email protected] wrote:
Hi folks,
In some of my work, I frequently run into folks having problems with native Intel events. And most of you know, I like to harp on people to basically ignore preset events these days because they foster misunderstanding as hardware just isn't the same as when PAPI was first written (and through their general abstraction). However, native events aren't panacea… One still needs to often RT(f)M in order to fully understand what one is seeing. To save you that hassle, I'm providing this bit of info...as I find reading Intel docs right up there with having to read that Ayn Rand or Joel Osteen novel that crazy friends give you.
This is a message to a client who has had issues on HSX aka Haswell-EP and JKT aka Sandy Bridge EP processors, most of which is in common with much of the E5 processor line. I suppose this should be turned into a FAQ entry on the PAPI page, but that depends on your comments, which are most welcome.
Note that the below native events are in Intel ‘parlance' with the ‘.' qualifier. libpfm now accepts these fully, thanks to the work of my good friend and perf-dude extraordinaire, Mr. Stephane Eranian of Google.
Regards,
Phil
Below is my list of events that I suspect are not mapped correctly. These events remain consistently screwy for all of the applications that I've looked at so far, so it's not real application behavior.
SandyBridge (SNBEP aka JKT):
mem_load_uops_llc_miss_retired.remote_dram mem_load_uops_retired.l1_hit mem_load_uops_retired.l2_hit mem_load_uops_retired.llc_hit mem_load_uops_retired.llc_miss mem_load_uops_llc_hit_retired.xsnp_hit mem_load_uops_llc_hit_retired.xsnp_hitm mem_load_uops_llc_hit_retired.xsnp_miss
All of the above have errata on the E5 processor. The errata are BT241 (undercounts) and BT243 (unreliable/corruption). The former is a hardware bug the latter is a bug that is a byproduct of hyperthreading.
See: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf and page 82/83
There is a workaround for BT241, but it increases L3 and main memory latencies. It also requires some permissions that most regular users don't have, i.e. writing bits to
/dev/cpu_dma_latency
and/sys/pci
as one tweak some bits in MSRs and the PCI bus space.
Yes, this is the late GO (Global Observability) bug. The kernel does not do anything on this one simply because the tradeoff is severe given the performance loss of the workaround. It is left to each user to decide if they can tolerate the slowdown while measuring.
Workarounds exist in the pmu-tools latego.py script. https://github.com/andikleen/pmu-tools
Make sure you disable them after you count, otherwise you are hosing your machines performance!
$ latego.py enable mem_load_uops_retired.llc_miss do papi stuff $ latego.py enable mem_load_uops_retired.llc_miss
For hyperthreading, one can reduce the problem by making sure the per-thread mask only contains one of two threads on the same core. numactl or taskset ahead of time and make sure you understand the mappings. HT siblings are usually high-order processor numbers. But it's still there… the only foolproof way is to disable HT in BIOS…
Due to this erratum, the Local Memory Read / Load Retired PerfMon events listed below may undercount.
MEM_LOAD_UOPS_RETIRED.LLC_HIT MEM_LOAD_UOPS_RETIRED.LLC_MISS* MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM* MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM* MEM_TRANS_RETIRED.LOAD_LATENCY*
The undercount of these events can be partially resolved (but not eliminated) by setting MSR_PEBS_NUM_ALT. PEBS Accuracy Enable (MSR 39CH; bit 0) to 1. When using the events marked with an asterisk, set the Direct-to-core disable field (Bus 1; Device 14; Function 0; Offset 84; bit 1) to 1 for Local memory reads and (Bus 1; Device 8; Function 0; Offset 80; bit 1) to 1 and (Bus 1; Device 9; Function 0; Offset 80; bit 1) to 1 for Remote memory reads. The improved accuracy comes at the cost of a reduction in performance; this workaround generally should not be used during normal operation.
When operating with SMT enabled, a memory at-retirement performance monitoring event (from the list below) may be dropped or may increment an enabled event on the corresponding counter with the same number on the physical core's other thread rather than the thread experiencing the event. Processors with SMT disabled in BIOS are not affected by this erratum
The list of affected memory at-retirement events is as follows:
MEM_UOP_RETIRED.LOADS MEM_UOP_RETIRED.STORES MEM_UOP_RETIRED.LOCK MEM_UOP_RETIRED.SPLIT MEM_UOP_RETIRED.STLB_MISS MEM_LOAD_UOPS_RETIRED.HIT_LFB MEM_LOAD_UOPS_RETIRED.L1_HIT MEM_LOAD_UOPS_RETIRED.L2_HIT MEM_LOAD_UOPS_RETIRED.LLC_HIT MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE MEM_LOAD_UOPS_RETIRED.LLC_MISS MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM MEM_LOAD_UOPS_RETIRED.L2_MISS
Yes, this is the infamous HT bug causing cross HT counter corruption. If any of these events is measure on counterX in one HT, then counterX on the sibling HT may get corrupted. For this problem, we have developed a kernel workaround which has been accepted in Linux 4.1 kernel. There will be a presentation on this work at SC16. The workaround avoid the corruption on the sibling counter. But it does not correct the leak from the corrupting counter. For all I know, this workaround may have been backported by Redhat and other distro to older kernels.
fp_comp_ops_exe.sse_scalar_single fp_comp_ops_exe.sse_packed_single
As far as these go, there is no known issues with them from Intel AFAICT. If Mr. Bandwidth aka the famous John McCalpin is lurking here, he might have something to add. Some dated microbenchmarks seem to validate their counting. https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
I believe the FLOPS events were fixed in Broadwell and clearly documented in their event files here.
Haswell (HSX):
cycle_activity.cycles_l1d_pending cycle_activity.stalls_l1d_pending
For these events, there is likely a bug in the released kernel scheduling it on the wrong counter. See https://github.com/andikleen/pmu-tools/issues/18
Yes, and it was fixed in Linux 4.0.
mem_load_uops_l3_hit_retired.xsnp_hit mem_load_uops_l3_hit_retired.xsnp_hitm mem_load_uops_l3_hit_retired.xsnp_miss mem_load_uops_l3_miss_retired.remote_dram mem_load_uops_l3_miss_retired.remote_fwd mem_load_uops_l3_miss_retired.remote_hitm
Here again, there are two errata, HSM26 (this time no workaround) and HSM30 (hyperthreading). http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-mobile-specification-update.pdf
Reproduced here below:
Certain Local Memory Read / Load Retired PerfMon Events May
Undercount
Due to this erratum, the Local Memory Read / Load Retired PerfMon events listed below may undercount.
MEM_LOAD_UOPS_RETIRED.L3_HIT (Event D1H Umask 04H) MEM_LOAD_UOPS_RETIRED.L3_MISS (Event D1H Umask 20H) MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS (Event D2H Umask 01H) MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT (Event D2H Umask 02H) MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM (Event D2H Umask 04H) MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE (Event D2H Umask 08H) MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM (Event D3H Umask 01H) MEM_TRANS_RETIRED.LOAD_LATENCY (Event CDH Umask 01H) PAGE_WALKER_LOADS.DTLB_L3 (Event BCH Umask 14H) PAGE_WALKER_LOADS.ITLB_L3 (Event BCH Umask 24H) PAGE_WALKER_LOADS.DTLB_Memory (Event BCH Umask 18H) PAGE_WALKER_LOADS.ITLB_Memory (Event BCH Umask 28H)
The affected events may undercount, resulting in inaccurate memory profiles. Intel has observed undercounts by as much as 40%.
Performance Monitor Counters May Produce Incorrect Results
When operating with SMT enabled, a memory at-retirement performance monitoring event (from the list below) may be dropped or may increment an enabled event on the corresponding counter with the same number on the physical core's other thread rather than the thread experiencing the event. Processors with SMT disabled in BIOS are not affected by this erratum.
The list of affected memory at-retirement events is as follows:
MEM_UOP_RETIRED.LOADS MEM_UOP_RETIRED.STORES MEM_UOP_RETIRED.LOCK MEM_UOP_RETIRED.SPLIT MEM_UOP_RETIRED.STLB_MISS MEM_LOAD_UOPS_RETIRED.HIT_LFB MEM_LOAD_UOPS_RETIRED.L1_HIT MEM_LOAD_UOPS_RETIRED.L2_HIT MEM_LOAD_UOPS_RETIRED.L3_HIT MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE MEM_LOAD_UOPS_RETIRED.L3_MISS MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM MEM_LOAD_UOPS_RETIRED.L2_MISS
Due to this erratum, certain performance monitoring event will produce unreliable results during hyper-threaded operation.
Fixed by kernel workaround in 4.1
uops_issued_single_mul
This event is missing a period, it's called
uops_issued.single_mul
. This event is very likely a kernel scheduling bug. Although I don't know if anyone's ever tested this event and the Intel documentation does not clarify what packed means here, and whether it applies to x87, SSE or AVX. So it's usefulness is TBD.
This event is not marked with any constraints in the official event table. Are you saying it always counts to 0?
Hope this helps. Not sure I can post on the PAPI mailing list. If not, please forward to this list. Thanks.
Ptools-perfapi mailing list
[email protected]
http://lists.eecs.utk.edu/mailman/listinfo/ptools-perfapi
- HPX Resource Guide
- HPX Source Code Structure and Coding Standards
- Improvement of the HPX core runtime
- How to Get Involved in Developing HPX
- How to Report Bugs in HPX
- Known issues in HPX V1.0.0
- HPX continuous integration build configurations
- How to run HPX on various Cluster environments
- Google Summer of Code
- Google Season of Documentation
- Documentation Projects
- Planning and coordination