Intel 64 and IA32 Architectures Performance Monitoring Events [pdf]

(software.intel.com)

48 points | by ingve 2323 days ago

3 comments

  • strstr 2323 days ago
    Perf counters are super useful. On linux the perf tool (and perf event api) make these usable: https://perf.wiki.kernel.org/index.php/Main_Page

    The counters vary per Intel CPU, though the most useful ones are universal (e.g. cycle counts). AMD has similar counters.

  • CalChris 2323 days ago
    If you're a low level hacker a reading knowledge of these events is useful but really we should be using VTune as our PME tool. Still, it's possible that a particular event may shed light on a particular piece of code and using an API like PCM.

    https://github.com/opcm/pcm

    • kev009 2323 days ago
      vtune is first class but most people will be using perf on linux or pmcstat on freebsd so you do need to crossreference a doc like this occasionally when you want to probe a new counter to look for bottlenecks.

      pcm is also quite nice to monitor what an entire system is doing in terms of memory bandwidth, NUMA link traffic, and other package level concerns but doesn't give any kernel or application level tracing like the other tools.

  • grandmczeb 2323 days ago
    Open question to other commenters: are there hardware performance counters/features that you would like to see implemented but currently aren’t?
    • _chris_ 2323 days ago
      As a RISC-V core implementer, I'm super interested in answers to this question. Some of the things I've pondered is ways to figure out 1) what branch am I constantly mispredicting and 2) what load is constantly cache-missing. Not sure the best way to expose that to the programmer, particularly in a way that's cheap for most cores.
      • strstr 2323 days ago
        1) Modern LBR might solve this. LWN has a summary (though I've only skimmed this): https://lwn.net/Articles/680996/

        2) Not sure for this, though I can think of some crappy hacks:

        --A) Timed LBR mentioned in that LWN article (somewhat indirect, but might get the job done)

        --B) use perf counter overflow interrupts (for cache misses) and set the perf counter initial value high (which should let you sample the cache miss locations). This can only tell you if a particular load is making up a large fraction of your overall cache misses (which is probably not super useful).

        Edit: Forgot about PEBS, which is really what you want for 2).

    • lallysingh 2322 days ago
      Unless there's information available now, I'd love to know more about CPU port utilization. Can I determine how to reorder my instructions for better scheduling?
    • wyldfire 2323 days ago
      Are the uncore features well represented with perf counters? I've been out of the loop for a while but that was one area that was challenging to investigate back in the day.
    • lallysingh 2322 days ago
      Something like the LBR for cache misses. I'd love to know which IP/address values caused an l3 miss.