Roofline model

Last updated
An example of a Roofline model in its basic form. As the image shows, the curve consists of two platform-specific performance ceilings: the processor's peak performance and a ceiling derived from the memory bandwidth. Both axes are in logarithmic scale Example of a Roofline model.svg
An example of a Roofline model in its basic form. As the image shows, the curve consists of two platform-specific performance ceilings: the processor's peak performance and a ceiling derived from the memory bandwidth. Both axes are in logarithmic scale

The Roofline model is an intuitive visual performance model used to provide performance estimates of a given compute kernel or application running on multi-core, many-core, or accelerator processor architectures, by showing inherent hardware limitations, and potential benefit and priority of optimizations. By combining locality, bandwidth, and different parallelization paradigms into a single performance figure, the model can be an effective alternative to assess the quality of attained performance instead of using simple percent-of-peak estimates, as it provides insights on both the implementation and inherent performance limitations.

Contents

The most basic Roofline model can be visualized by plotting floating-point performance as a function of machine peak performance[ vague ][ clarification needed ], machine peak bandwidth, and arithmetic intensity. The resultant curve is effectively a performance bound under which kernel or application performance exists, and includes two platform-specific performance ceilings[ clarification needed ]: a ceiling derived from the memory bandwidth and one derived from the processor's peak performance (see figure on the right).

Work

The work denotes the number of operations performed by a given kernel or application. [1] This metric may refer to any type of operation, from number of array points updated, to number of integer operations, to number of floating point operations (FLOPs), [2] and the choice of one or another is driven by convenience. In the majority of the cases however, is expressed as FLOPs. [1] [3] [4] [5] [6]

Note that the work is a property of the given kernel or application and thus depend just partially on the platform characteristics.

Memory traffic

The memory traffic denotes the number of bytes of memory transfers incurred during the execution of the kernel or application. [1] In contrast to , is heavily dependent on the properties of the chosen platform, such as for instance the structure of the cache hierarchy. [1]

Arithmetic intensity

The arithmetic intensity, also referred to as operational intensity, [3] [7] is the ratio of the work to the memory traffic : [1]

and denotes the number of operations per byte of memory traffic. When the work is expressed as FLOPs, the resulting arithmetic intensity will be the ratio of floating point operations to total data movement (FLOPs/byte).

Naive Roofline

Example of a naive Roofline plot where two kernels are reported. The first (vertical dashed red line) has an arithmetic intensity
O
1
{\displaystyle O_{1}}
that is underneath the peak bandwidth ceiling (diagonal solid black line), and is then memory-bound. Instead, the second (corresponding to the rightmost vertical dashed red line) has an arithmetic intensity
O
2
{\displaystyle O_{2}}
that is underneath the peak performance ceiling (horizontal solid black line), and thus is compute-bound. Example of a naive Roofline model.svg
Example of a naïve Roofline plot where two kernels are reported. The first (vertical dashed red line) has an arithmetic intensity that is underneath the peak bandwidth ceiling (diagonal solid black line), and is then memory-bound . Instead, the second (corresponding to the rightmost vertical dashed red line) has an arithmetic intensity that is underneath the peak performance ceiling (horizontal solid black line), and thus is compute-bound .

The naïve Roofline [3] is obtained by applying simple bound and bottleneck analysis. [8] In this formulation of the Roofline model, there are only two parameters, the peak performance and the peak bandwidth of the specific architecture, and one variable, the arithmetic intensity. The peak performance, in general expressed as GFLOPS, can be usually derived from benchmarking, while the peak bandwidth, that references to peak DRAM bandwidth to be specific, is instead obtained via architectural manuals. [1] [3] The resulting plot, in general with both axes in logarithmic scale, is then derived by the following formula: [1]

where is the attainable performance, is the peak performance, is the peak bandwidth and is the arithmetic intensity. The point at which the performance saturates at the peak performance level , that is where the diagonal and horizontal roof meet, is defined as ridge point. [4] The ridge point offers insight on the machine's overall performance, by providing the minimum arithmetic intensity required to be able to achieve peak performance, and by suggesting at a glance the amount of effort required by the programmer to achieve peak performance. [4]

A given kernel or application is then characterized by a point given by its arithmetic intensity (on the x-axis). The attainable performance is then computed by drawing a vertical line that hits the Roofline curve. Hence. the kernel or application is said to be memory-bound if . Conversely, if , the computation is said to be compute-bound. [1]

Adding ceilings to the model

The naive Roofline provides just an upper bound (the theoretical maximum) to performance. Although it can still give useful insights on the attainable performance, it does not provide a complete picture of what is actually limiting it. If, for instance, the considered kernel or application performs far below the Roofline, it might be useful to capture other performance ceilings, other than simple peak bandwidth and performance, to better guide the programmer on which optimization to implement, or even to assess the suitability of the architecture used with respect to the analyzed kernel or application. [3] The added ceilings impose then a limit on the attainable performance that is below the actual Roofline, and indicate that the kernel or application cannot break through anyone of these ceilings without first performing the associated optimization. [3] [4]

The Roofline plot can be expanded upon three different aspects: communication, adding the bandwidth ceilings; computation, adding the so-called in-core ceilings; and locality, adding the locality walls.

Bandwidth ceilings

The bandwidth ceilings are bandwidth diagonals placed below the idealized peak bandwidth diagonal. Their existence is due to the lack of some kind of memory related architectural optimization, such as cache coherence, or software optimization, such as poor exposure of concurrency (that in turn limit bandwidth usage). [3] [4]

In-core ceilings

The in-core ceilings are roofline-like curve beneath the actual roofline that may be present due to the lack of some form of parallelism. These ceilings effectively limit how high performance can reach. Performance cannot exceed an in-core ceiling until the underlying lack of parallelism is expressed and exploited. The ceilings can be also derived from architectural optimization manuals other than benchmarks. [3] [4]

Locality walls

If the ideal assumption that arithmetic intensity is solely a function of the kernel is removed, and the cache topology - and therefore cache misses - is taken into account, the arithmetic intensity clearly becomes dependent on a combination of kernel and architecture. This may result in a degradation in performance depending on the balance between the resultant arithmetic intensity and the ridge point. Unlike "proper" ceilings, the resulting lines on the Roofline plot are vertical barriers through which arithmetic intensity cannot pass without optimization. For this reason, they are referenced to as locality walls or arithmetic intensitywalls. [3] [4]

Extension of the model

Since its introduction, [3] [4] the model has been further extended to account for a broader set of metrics and hardware-related bottlenecks. Already available in literature there are extensions that take into account the impact of NUMA organization of memory, [6] of out-of-order execution, [9] of memory latencies, [9] [10] and to model at a finer grain the cache hierarchy [5] [9] in order to better understand what is actually limiting performance and drive the optimization process.

Also, the model has been extended to better suit specific architectures and the related characteristics, such as FPGAs. [11]

See also

Related Research Articles

<span class="mw-page-title-main">Central processing unit</span> Central computer component which executes instructions

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

<span class="mw-page-title-main">Non-uniform memory access</span> Computer memory design used in multiprocessing

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

<span class="mw-page-title-main">Symmetric multiprocessing</span> The equal sharing of all resources by multiple identical processors

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

<span class="mw-page-title-main">System on a chip</span> Micro-electronic component

A system on a chip or system-on-chip is an integrated circuit that integrates most or all components of a computer or other electronic system. These components almost always include on-chip central processing unit (CPU), memory interfaces, input/output devices, input/output interfaces, and secondary storage interfaces, often alongside other components such as radio modems and a graphics processing unit (GPU) – all on a single substrate or microchip. SoCs may contain digital, and also analog, mixed-signal, and often radio frequency signal processing functions.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer science, program optimization, code optimization, or software optimization, is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. In general, a computer program may be optimized so that it executes more rapidly, or to make it capable of operating with less memory storage or other resources, or draw less power.

<span class="mw-page-title-main">Instruction-level parallelism</span> Ability of computer instructions to be executed simultaneously with correct results

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

<span class="mw-page-title-main">UltraSPARC T1</span> Microprocessor by Sun Microsystems

Sun Microsystems' UltraSPARC T1 microprocessor, known until its 14 November 2005 announcement by its development codename "Niagara", is a multithreading, multicore CPU. Designed to lower the energy consumption of server computers, the CPU typically uses 72 W of power at 1.4 GHz.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

Intel Tera-Scale is a research program by Intel that focuses on development in Intel processors and platforms that utilize the inherent parallelism of emerging visual-computing applications. Such applications require teraFLOPS of parallel computing performance to process terabytes of data quickly. Parallelism is the concept of performing multiple tasks simultaneously. Utilizing parallelism will not only increase the efficiency of computer processing units (CPUs), but also increase the bytes of data analyzed each second. In order to appropriately apply parallelism, the CPU must be able to handle multiple threads and to do so the CPU must consist of multiple cores. The conventional amount of cores in consumer grade computers are 2–8 cores while workstation grade computers can have even greater amounts. However, even the current amount of cores aren't great enough to perform at teraFLOPS performance leading to an even greater amount of cores that must be added. As a result of the program, two prototypes have been manufactured that were used to test the feasibility of having many more cores than the conventional amount and proved to be successful.

Stream Processors, Inc. (SPI), was a Silicon Valley-based fabless semiconductor company specializing in the design and manufacture of high-performance digital signal processors for applications including video surveillance, multi-function printers and video conferencing. The company ceased operations in 2009.

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

Intel Advisor is a design assistance and analysis tool for SIMD vectorization, threading, memory use, and GPU offload optimization. The tool supports C, C++, Data Parallel C++ (DPC++), Fortran and Python languages. It is available on Windows and Linux operating systems in form of Standalone GUI tool, Microsoft Visual Studio plug-in or command line interface. It supports OpenMP. Intel Advisor user interface is also available on macOS.

<span class="mw-page-title-main">Fermi (microarchitecture)</span> GPU microarchitecture by Nvidia

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from NVIDIA that received support for Microsoft's rendering API Direct3D 12 feature_level 11.

Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs : arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.

In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, and also have implications for the approach to parallelism and distribution of workload in shared memory systems. Further, cache coherency issues can affect multiprocessor performance, which means that certain memory access patterns place a ceiling on parallelism.

A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The number of threads in a thread block was formerly limited by the architecture to a total of 512 threads per block, but as of March 2010, with compute capability 2.x and higher, blocks may contain up to 1024 threads. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations.

Parallelmultidimensionaldigitalsignal processing (mD-DSP) is defined as the application of parallel programming and multiprocessing to digital signal processing techniques to process digital signals that have more than a single dimension. The use of mD-DSP is fundamental to many application areas such as digital image and video processing, medical imaging, geophysical signal analysis, sonar, radar, lidar, array processing, computer vision, computational photography, and augmented and virtual reality. However, as the number of dimensions of a signal increases the computational complexity to operate on the signal increases rapidly. This relationship between the number of dimensions and the amount of complexity, related to both time and space, as studied in the field of algorithm analysis, is analogues to the concept of the curse of dimensionality. This large complexity generally results in an extremely long execution run-time of a given mD-DSP application rendering its usage to become impractical for many applications; especially for real-time applications. This long run-time is the primary motivation of applying parallel algorithmic techniques to mD-DSP problems.

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is parallel to Ada Lovelace.

References

  1. 1 2 3 4 5 6 7 8 Ofenbeck, G.; Steinmann, R.; Caparros, V.; Spampinato, D. G.; Püschel, M. (2014-03-01). "Applying the roofline model". 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). pp. 76–85. doi:10.1109/ISPASS.2014.6844463. ISBN   978-1-4799-3606-9. S2CID   206992177.
  2. David A.Patterson, John L. Hennessy. Computer Organisation and Design. p. 543.
  3. 1 2 3 4 5 6 7 8 9 10 Williams, Samuel W. (2008). Auto-tuning Performance on Multicore Computers (Ph.D.). University of California at Berkeley.
  4. 1 2 3 4 5 6 7 8 Williams, Samuel; Waterman, Andrew; Patterson, David (2009-04-01). "Roofline: An Insightful Visual Performance Model for Multicore Architectures". Commun. ACM. 52 (4): 65–76. doi:10.1145/1498765.1498785. ISSN   0001-0782. S2CID   7766361.
  5. 1 2 Ilic, A.; Pratas, F.; Sousa, L. (2014-01-01). "Cache-aware Roofline model: Upgrading the loft". IEEE Computer Architecture Letters. 13 (1): 21–24. doi:10.1109/L-CA.2013.6. ISSN   1556-6056. S2CID   9208032.
  6. 1 2 Lorenzo, Oscar G.; Pena, Tomás F.; Cabaleiro, José C.; Pichel, Juan C.; Rivera, Francisco F. (2014-03-31). "Using an extended Roofline Model to understand data and thread affinities on NUMA systems". Annals of Multicore and GPU Programming. 1 (1): 56–67. ISSN   2341-3158.
  7. "Roofline Performance Model". Lawrence Berkeley National Laboratory. Retrieved 19 June 2016.
  8. Kourtis, Kornilios; Goumas, Georgios; Koziris, Nectarios (2008-01-01). "Optimizing sparse matrix-vector multiplication using index and value compression". Proceedings of the 5th conference on Computing frontiers. CF '08. New York, NY, USA: ACM. pp. 87–96. CiteSeerX   10.1.1.140.9391 . doi:10.1145/1366230.1366244. ISBN   9781605580777. S2CID   8038147.
  9. 1 2 3 Cabezas, V. C.; Püschel, M. (2014-10-01). "Extending the roofline model: Bottleneck analysis with microarchitectural constraints". 2014 IEEE International Symposium on Workload Characterization (IISWC). pp. 222–231. doi:10.1109/IISWC.2014.6983061. ISBN   978-1-4799-6454-3. S2CID   33023605.
  10. Lorenzo, O. G.; Pena, T. F.; Cabaleiro, J. C.; Pichel, J. C.; Rivera, F. F. (2014-03-26). "3DyRM: a dynamic roofline model including memory latency information". The Journal of Supercomputing. 70 (2): 696–708. doi:10.1007/s11227-014-1163-4. ISSN   0920-8542. S2CID   5318695.
  11. da Silva, Bruno; Braeken, An; D'Hollander, Erik H.; Touhafi, Abdellah (2013-01-01). "Performance Modeling for FPGAs: Extending the Roofline Model with High-level Synthesis Tools". International Journal of Reconfigurable Computing. 2013: 1–10. doi: 10.1155/2013/428078 . ISSN   1687-7195.