Intel 64 and IA32 Architectures Performance Monitoring Events [pdf]

(software.intel.com)

48 points | by ingve 2323 days ago

3 comments

strstr 2323 days ago
Perf counters are super useful. On linux the perf tool (and perf event api) make these usable: https://perf.wiki.kernel.org/index.php/Main_Page
The counters vary per Intel CPU, though the most useful ones are universal (e.g. cycle counts). AMD has similar counters.
[-]
- soulbadguy 2323 days ago
  ocperf is wrapper around perf provided by someone at intel. At the first run, it downloads a list of counter specific the CPU detected, pretty cool; https://github.com/andikleen/pmu-tools
- lallysingh 2322 days ago
  Excellent timing. I'd just updated my performance tool (ppt). https://github.com/lally/libmet
  Includes really easy to use performance counter support.
CalChris 2323 days ago
If you're a low level hacker a reading knowledge of these events is useful but really we should be using VTune as our PME tool. Still, it's possible that a particular event may shed light on a particular piece of code and using an API like PCM.
https://github.com/opcm/pcm
[-]
- kev009 2323 days ago
  vtune is first class but most people will be using perf on linux or pmcstat on freebsd so you do need to crossreference a doc like this occasionally when you want to probe a new counter to look for bottlenecks.
  pcm is also quite nice to monitor what an entire system is doing in terms of memory bandwidth, NUMA link traffic, and other package level concerns but doesn't give any kernel or application level tracing like the other tools.
grandmczeb 2323 days ago
Open question to other commenters: are there hardware performance counters/features that you would like to see implemented but currently aren’t?
[-]
- _chris_ 2323 days ago
  As a RISC-V core implementer, I'm super interested in answers to this question. Some of the things I've pondered is ways to figure out 1) what branch am I constantly mispredicting and 2) what load is constantly cache-missing. Not sure the best way to expose that to the programmer, particularly in a way that's cheap for most cores.
  [-]
  - strstr 2323 days ago
    1) Modern LBR might solve this. LWN has a summary (though I've only skimmed this): https://lwn.net/Articles/680996/
    2) Not sure for this, though I can think of some crappy hacks:
    --A) Timed LBR mentioned in that LWN article (somewhat indirect, but might get the job done)
    --B) use perf counter overflow interrupts (for cache misses) and set the perf counter initial value high (which should let you sample the cache miss locations). This can only tell you if a particular load is making up a large fraction of your overall cache misses (which is probably not super useful).
    Edit: Forgot about PEBS, which is really what you want for 2).
- lallysingh 2322 days ago
  Unless there's information available now, I'd love to know more about CPU port utilization. Can I determine how to reorder my instructions for better scheduling?
- wyldfire 2323 days ago
  Are the uncore features well represented with perf counters? I've been out of the loop for a while but that was one area that was challenging to investigate back in the day.
- lallysingh 2322 days ago
  Something like the LBR for cache misses. I'd love to know which IP/address values caused an l3 miss.