Hardware performance counter

Last updated August 08, 2023

In computers, hardware performance counters (HPC),^[1] or hardware counters are a set of special-purpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems. Advanced users often rely on those counters to conduct low-level performance analysis or tuning.

Implementations

The number of available hardware counters in a processor is limited while each CPU model might have a lot of different events that a developer might like to measure. Each counter can be programmed with the index of an event type to be monitored, like a L1 cache miss or a branch misprediction.

One of the first processors to implement such counter and an associated instruction RDPMC to access it was the Intel Pentium, but they were not documented until Terje Mathisen wrote an article about reverse engineering them in Byte July 1994.^[2]

The following table shows some examples of CPUs and the number of available hardware counters:

Processor	available HW counters
UltraSparc II	2
Pentium III	2
ARM11	2
AMD Athlon	4
IA-64	4
ARM Cortex-A5	2^[3]
ARM Cortex-A8	4
ARM Cortex-A9 MPCore	6
POWER4	8
Pentium 4	18

Versus software techniques

Compared to software profilers, hardware counters provide low-overhead access to a wealth of detailed performance information related to CPU's functional units, caches and main memory etc. Another benefit of using them is that no source code modifications are needed in general. However, the types and meanings of hardware counters vary from one kind of architecture to another due to the variation in hardware organizations.

There can be difficulties correlating the low level performance metrics back to source code. The limited number of registers to store the counters often force users to conduct multiple measurements to collect all desired performance metrics.

Instruction based sampling

Output of an IBS profile from CodeAnalyst.

Modern superscalar processors schedule and execute multiple instructions out-of-order at one time. These "in-flight" instructions can retire at any time, depending on memory access, hits in cache, stalls in the pipeline and many other factors. This can cause performance counter events to be attributed to the wrong instructions, making precise performance analysis difficult or impossible.

AMD introduced methods to mitigate some of these drawbacks. For example, the Opteron processors have implemented ^[4] in 2007 a technique known as Instruction Based Sampling (or IBS). AMD's implementation of IBS provides hardware counters for both fetch sampling (the front of the superscalar pipeline) and op sampling (the back of the pipeline). This results in discrete performance data associating retired instructions with the "parent" AMD64 instruction.

Related Research Articles

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

A complex instruction set computer is a computer architecture in which single instructions can execute several low-level operations or are capable of multi-step operations or addressing modes within single instructions. The term was retroactively coined in contrast to reduced instruction set computer (RISC) and has therefore become something of an umbrella term for everything that is not RISC, where the typical differentiating characteristic is that most RISC designs use uniform instruction length for almost all instructions, and employ strictly separate load and store instructions.

The Intel 486, officially named i486 and also known as 80486, is a microprocessor. It is a higher-performance follow-up to the Intel 386. The i486 was introduced in 1989. It represents the fourth generation of binary compatible CPUs following the 8086 of 1978, the Intel 80286 of 1982, and 1985's i386.

The Pentium is a fifth generation, 32-bit x86 microprocessor that was introduced by Intel on March 22, 1993, as the very first CPU in the Pentium brand. It was instruction set compatible with the 80486 but was a new and very different microarchitecture design from previous iterations. The P5 Pentium was the first superscalar x86 microarchitecture and the world's first superscalar microprocessor to be in mass production—meaning it generally executes at least 2 instructions per clock mainly because of a design-first dual integer pipeline design previously thought impossible to implement on a CISC microarchitecture. Additional features include a faster floating-point unit, wider data bus, separate code and data caches, and many other techniques and features to enhance performance and support security, encryption, and multiprocessing, for workstations and servers when compared to the next best previous industry standard processor implementation before it, the Intel 80486.

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors. Colloquially, their names were "186", "286", "386" and "486".

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

Hyper-threading is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations performed on x86 microprocessors. It was introduced on Xeon server processors in February 2002 and on Pentium 4 desktop processors in November 2002. Since then, Intel has included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.

Cyrix Corporation was a microprocessor developer that was founded in 1988 in Richardson, Texas, as a specialist supplier of floating point units for 286 and 386 microprocessors. The company was founded by Tom Brightman and Jerry Rogers.

In computer architecture, instructions per cycle (IPC), commonly called instructions per clock is one aspect of a processor's performance: the average number of instructions executed for each clock cycle. It is the multiplicative inverse of cycles per instruction.

In computing, the clock rate or clock speed typically refers to the frequency at which the clock generator of a processor can generate pulses, which are used to synchronize the operations of its components, and is used as an indicator of the processor's speed. It is measured in the SI unit of frequency hertz (Hz).

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures such as x86.

A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. In computer architecture, registers are typically addressed by mechanisms other than main memory, but may in some cases be assigned a memory address e.g. DEC PDP-10, ICT 1900.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

<span class="mw-page-title-main">History of general-purpose CPUs</span> History of processors used in general purpose computers

The history of general-purpose CPUs is a continuation of the earlier history of computing hardware.

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

<span class="mw-page-title-main">AMD CodeAnalyst</span>

AMD CodeAnalyst is a GUI-based code profiler for x86 and x86-64-based machines. CodeAnalyst has similar look and feel on both Linux and Microsoft Windows platforms. CodeAnalyst uses specific hardware profiling techniques which are designed to work with AMD processors, as well as a timer-based profiling technique which does not require specific hardware support; this allows a subset of profiling features to work on non-AMD processors, such as Intel processors.

The Alpha 21264 is a Digital Equipment Corporation RISC microprocessor launched on 19 October 1998. The 21264 implemented the Alpha instruction set architecture (ISA).

Meltdown is one of the two original transient execution CPU vulnerabilities. Meltdown affects Intel x86 microprocessors, IBM POWER processors, and some ARM-based microprocessors. It allows a rogue process to read all memory, even when it is not authorized to do so.

References

↑ Malone, Corey; Zahran, Mohamed; Karri, Ramesh (2011). "Are hardware performance counters a cost effective way for integrity checking of programs" (PDF). Proceedings of the sixth ACM workshop on Scalable trusted computing. pp. 71–76. doi:10.1145/2046582.2046596. ISBN 9781450310017. S2CID 16409864 . Retrieved 17 November 2022.
↑ "Pentium Secrets". Gamedev.net. Retrieved 2012-02-14.
↑ "Documentation – Arm Developer". developer.arm.com.
↑ "Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors" (PDF). AMD . Retrieved 2015-10-16.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Malone, Corey; Zahran, Mohamed; Karri, Ramesh (2011). "Are hardware performance counters a cost effective way for integrity checking of programs" (PDF). Proceedings of the sixth ACM workshop on Scalable trusted computing. pp. 71–76. doi:10.1145/2046582.2046596. ISBN 9781450310017. S2CID 16409864 . Retrieved 17 November 2022.

[2] "Pentium Secrets". Gamedev.net. Retrieved 2012-02-14.

[armcortexinfo-3] "Documentation – Arm Developer". developer.arm.com.

[4] "Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors" (PDF). AMD . Retrieved 2015-10-16.

[1]

[2]

[3]

[4]