Benchmark (computing)

Last updated
A graphical demo running as a benchmark of the OGRE engine OGRE screenshot 08.png
A graphical demo running as a benchmark of the OGRE engine

In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. [1]

Contents

The term benchmark is also commonly utilized for the purposes of elaborately designed benchmarking programs themselves.

Benchmarking is usually associated with assessing performance characteristics of computer hardware, for example, the floating point operation performance of a CPU, but there are circumstances when the technique is also applicable to software. Software benchmarks are, for example, run against compilers or database management systems (DBMS).

Benchmarks provide a method of comparing the performance of various subsystems across different chip/system architectures. Benchmarking as a part of continuous integration is called Continuous Benchmarking. [2]

Purpose

As computer architecture advanced, it became more difficult to compare the performance of various computer systems simply by looking at their specifications. Therefore, tests were developed that allowed comparison of different architectures. For example, Pentium 4 processors generally operated at a higher clock frequency than Athlon XP or PowerPC processors, which did not necessarily translate to more computational power; a processor with a slower clock frequency might perform as well as or even better than a processor operating at a higher frequency. See BogoMips and the megahertz myth.

Benchmarks are designed to mimic a particular type of workload on a component or system. Synthetic benchmarks do this by specially created programs that impose the workload on the component. Application benchmarks run real-world programs on the system. While application benchmarks usually give a much better measure of real-world performance on a given system, synthetic benchmarks are useful for testing individual components, like a hard disk or networking device.

Benchmarks are particularly important in CPU design, giving processor architects the ability to measure and make tradeoffs in microarchitectural decisions. For example, if a benchmark extracts the key algorithms of an application, it will contain the performance-sensitive aspects of that application. Running this much smaller snippet on a cycle-accurate simulator can give clues on how to improve performance.

Prior to 2000, computer and microprocessor architects used SPEC to do this, although SPEC's Unix-based benchmarks were quite lengthy and thus unwieldy to use intact.

Computer manufacturers are known to configure their systems to give unrealistically high performance on benchmark tests that are not replicated in real usage. For instance, during the 1980s some compilers could detect a specific mathematical operation used in a well-known floating-point benchmark and replace the operation with a faster mathematically equivalent operation. However, such a transformation was rarely useful outside the benchmark until the mid-1990s, when RISC and VLIW architectures emphasized the importance of compiler technology as it related to performance. Benchmarks are now regularly used by compiler companies to improve not only their own benchmark scores, but real application performance.

CPUs that have many execution units — such as a superscalar CPU, a VLIW CPU, or a reconfigurable computing CPU — typically have slower clock rates than a sequential CPU with one or two execution units when built from transistors that are just as fast. Nevertheless, CPUs with many execution units often complete real-world and benchmark tasks in less time than the supposedly faster high-clock-rate CPU.

Given the large number of benchmarks available, a manufacturer can usually find at least one benchmark that shows its system will outperform another system; the other systems can be shown to excel with a different benchmark.

Manufacturers commonly report only those benchmarks (or aspects of benchmarks) that show their products in the best light. They also have been known to mis-represent the significance of benchmarks, again to show their products in the best possible light. Taken together, these practices are called bench-marketing.

Ideally benchmarks should only substitute for real applications if the application is unavailable, or too difficult or costly to port to a specific processor or computer system. If performance is critical, the only benchmark that matters is the target environment's application suite.

Functionality

Features of benchmarking software may include recording/exporting the course of performance to a spreadsheet file, visualization such as drawing line graphs or color-coded tiles, and pausing the process to be able to resume without having to start over. Software can have additional features specific to its purpose, for example, disk benchmarking software may be able to optionally start measuring the disk speed within a specified range of the disk rather than the full disk, measure random access reading speed and latency, have a "quick scan" feature which measures the speed through samples of specified intervals and sizes, and allow specifying a data block size, meaning the number of requested bytes per read request. [3]

Challenges

Benchmarking is not easy and often involves several iterative rounds in order to arrive at predictable, useful conclusions. Interpretation of benchmarking data is also extraordinarily difficult. Here is a partial list of common challenges:

Benchmarking Principles

There are seven vital characteristics for benchmarks. [6] These key properties are:

  1. Relevance: Benchmarks should measure relatively vital features.
  2. Representativeness: Benchmark performance metrics should be broadly accepted by industry and academia.
  3. Equity: All systems should be fairly compared.
  4. Repeatability: Benchmark results can be verified.
  5. Cost-effectiveness: Benchmark tests are economical.
  6. Scalability: Benchmark tests should work across systems possessing a range of resources from low to high.
  7. Transparency: Benchmark metrics should be easy to understand.

Types of benchmark

  1. Real program
  2. Component Benchmark / Microbenchmark
    • core routine consists of a relatively small and specific piece of code.
    • measure performance of a computer's basic components [7]
    • may be used for automatic detection of computer's hardware parameters like number of registers, cache size, memory latency, etc.
  3. Kernel
    • contains key codes
    • normally abstracted from actual program
    • popular kernel: Livermore loop
    • linpack benchmark (contains basic linear algebra subroutine written in FORTRAN language)
    • results are represented in Mflop/s.
  4. Synthetic Benchmark
    • Procedure for programming synthetic benchmark:
      • take statistics of all types of operations from many application programs
      • get proportion of each operation
      • write program based on the proportion above
    • Types of Synthetic Benchmark are:
    • These were the first general purpose industry standard computer benchmarks. They do not necessarily obtain high scores on modern pipelined computers.
  5. I/O benchmarks
  6. Database benchmarks
    • measure the throughput and response times of database management systems (DBMS)
  7. Parallel benchmarks
    • used on machines with multiple cores and/or processors, or systems consisting of multiple machines

Common benchmarks

Industry standard (audited and verifiable)

Open source benchmarks

Microsoft Windows benchmarks

Others

See also

Related Research Articles

Dhrystone is a synthetic computing benchmark program developed in 1984 by Reinhold P. Weicker intended to be representative of system (integer) programming. The Dhrystone grew to become representative of general processor (CPU) performance. The name "Dhrystone" is a pun on a different benchmark algorithm called Whetstone, which emphasizes floating point performance.

<span class="mw-page-title-main">Mainframe computer</span> Large computer

A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise resource planning, and large-scale transaction processing. A mainframe computer is large but not as large as a supercomputer and has more processing power than some other classes of computers, such as minicomputers, servers, workstations, and personal computers. Most large-scale computer-system architectures were established in the 1960s, but they continue to evolve. Mainframe computers are often used as servers.

<span class="mw-page-title-main">Instructions per second</span> Measure of a computers processing speed

Instructions per second (IPS) is a measure of a computer's processor speed. For complex instruction set computers (CISCs), different instructions take different amounts of time, so the value measured depends on the instruction mix; even for comparing processors in the same family the IPS measurement can be problematic. Many reported IPS values have represented "peak" execution rates on artificial instruction sequences with few branches and no cache contention, whereas realistic workloads typically lead to significantly lower IPS values. Memory hierarchy also greatly affects processor performance, an issue barely considered in IPS calculations. Because of these problems, synthetic benchmarks such as Dhrystone are now generally used to estimate computer performance in commonly used applications, and raw IPS has fallen into disuse.

In software quality assurance, performance testing is in general a testing practice performed to determine how a system performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate, measure, validate or verify other quality attributes of the system, such as scalability, reliability and resource usage.

In computer science, algorithmic efficiency is a property of an algorithm which relates to the amount of computational resources used by the algorithm. An algorithm must be analyzed to determine its resource usage, and the efficiency of an algorithm can be measured based on the usage of different resources. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process.

<span class="mw-page-title-main">High-performance computing</span> Computing with supercomputers and clusters

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems.

The thermal design power (TDP), sometimes called thermal design point, is the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload.

The Whetstone benchmark is a synthetic benchmark for evaluating the performance of computers. It was first written in Algol 60 in 1972 at the Technical Support Unit of the Department of Trade and Industry in the United Kingdom. It was derived from statistics on program behaviour gathered on the KDF9 computer at NPL National Physical Laboratory, using a modified version of its Whetstone ALGOL 60 compiler. The workload on the machine was represented as a set of frequencies of execution of the 124 instructions of the Whetstone Code. The Whetstone Compiler was built at the Atomic Power Division of the English Electric Company in Whetstone, Leicestershire, England, hence its name. Dr. B.A. Wichman at NPL produced a set of 42 simple ALGOL 60 statements, which in a suitable combination matched the execution statistics.

KDF9 was an early British 48-bit computer designed and built by English Electric. The first machine came into service in 1964 and the last of 29 machines was decommissioned in 1980 at the National Physical Laboratory. The KDF9 was designed for, and used almost entirely in, the mathematical and scientific processing fields – in 1967, nine were in use in UK universities and technical colleges. The KDF8, developed in parallel, was aimed at commercial processing workloads.

<span class="mw-page-title-main">Diskless node</span> Computer workstation operated without disk drives

A diskless node is a workstation or personal computer without disk drives, which employs network booting to load its operating system from a server.

SPECint is a computer benchmark specification for CPU integer processing power. It is maintained by the Standard Performance Evaluation Corporation (SPEC). SPECint is the integer performance testing component of the SPEC test suite. The first SPEC test suite, CPU92, was announced in 1992. It was followed by CPU95, CPU2000, and CPU2006. The latest standard is SPEC CPU 2017 and consists of SPECspeed and SPECrate.

SPECfp is a computer benchmark designed to test the floating-point performance of a computer. It is managed by the Standard Performance Evaluation Corporation. SPECfp is the floating-point performance testing component of the SPEC CPU testing suit. The first standard SPECfp was released in 1989 as SPECfp89. Later it was replaced by SPECfp92, then SPECfp95, then SPECfp2000, then SPECfp2006, and finally SPECfp2017.

In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructions. When it comes to high computer performance, one or more of the following factors might be involved:

Transaction Processing over XML (TPoX) is a computing benchmark for XML database systems. As a benchmark, TPoX is used for the performance testing of database management systems that are capable of storing, searching, modifying and retrieving XML data. The goal of TPoX is to allow database designers, developers and users to evaluate the performance of XML database features, such as the XML query languages XQuery and SQL/XML, XML storage, XML indexing, XML Schema support, XML updates, transaction processing and logging, and concurrency control. TPoX includes XML update tests based on the XQuery Update Facility.

CoreMark is a benchmark that measures the performance of central processing units (CPU) used in embedded systems. It was developed in 2009 by Shay Gal-On at EEMBC and is intended to become an industry standard, replacing the Dhrystone benchmark. The code is written in C and contains implementations of the following algorithms: list processing, matrix manipulation, state machine, and CRC. The code is under the Apache License 2.0 and is free of cost to use, but ownership is retained by the Consortium and publication of modified versions under the CoreMark name prohibited.

In transaction processing, the Telecommunication Application Transaction Processing Benchmark (TATP) is a benchmark designed to measure the performance of in-memory database transaction systems.

Princeton Application Repository for Shared-Memory Computers (PARSEC) is a benchmark suite composed of multi-threaded emerging workloads that is used to evaluate and develop next-generation chip-multiprocessors. It was collaboratively created by Intel and Princeton University to drive research efforts on future computer systems. Since its inception the benchmark suite has become a community project that is continued to be improved by a broad range of research institutions. PARSEC is freely available and is used for both academic and non-academic research.

NBench, short for Native mode Benchmark and later known as BYTEmark, is a synthetic computing benchmark program developed in the mid-1990s by the now defunct BYTE magazine intended to measure a computer's CPU, FPU, and Memory System speed.

Electronic systems’ power consumption has been a real challenge for Hardware and Software designers as well as users especially in portable devices like cell phones and laptop computers. Power consumption also has been an issue for many industries that use computer systems heavily such as Internet service providers using servers or companies with many employees using computers and other computational devices. Many different approaches have been discovered by researchers to estimate power consumption efficiently. This survey paper focuses on the different methods where power consumption can be estimated or measured in real-time.

<span class="mw-page-title-main">TPC-C</span> Benchmark used to compare the performance of OLTP systems

TPC-C, short for Transaction Processing Performance Council Benchmark C, is a benchmark used to compare the performance of online transaction processing (OLTP) systems. This industry standard was published in August 1992, and eventually replaced the earlier TPC-A, which was declared obsolete in 1995. It has undergone a number of changes to keep it relevant as computer performance grew by several orders of magnitude, with the current version as of 2021, 5.11, released in 2010. In 2006, a newer OLTP benchmark was added to the suite, TPC-E, but TPC-C remains in widespread use.

References

  1. Fleming, Philip J.; Wallace, John J. (1986-03-01). "How not to lie with statistics: the correct way to summarize benchmark results". Communications of the ACM. 29 (3): 218–221. doi: 10.1145/5666.5673 . ISSN   0001-0782. S2CID   1047380 . Retrieved 2017-06-09.
  2. Grambow, Martin; Lehmann, Fabian; Bermbach, David (2019). Continuous Benchmarking: Using System Benchmarking in Build Pipelines. doi:10.1109/IC2E.2019.00039 . Retrieved 2023-12-03.
  3. Software: HDDScan, GNOME Disks
  4. Krazit, Tom (2003). "NVidia's Benchmark Tactics Reassessed". IDG News. Archived from the original on 2011-06-06. Retrieved 2009-08-08.
  5. Castor, Kevin (2006). "Hardware Testing and Benchmarking Methodology". Archived from the original on 2008-02-05. Retrieved 2008-02-24.
  6. Dai, Wei; Berleant, Daniel (December 12–14, 2019). "Benchmarking Contemporary Deep Learning Hardware and Frameworks: a Survey of Qualitative Metrics" (PDF). 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI). Los Angeles, CA, USA: IEEE. pp. 148–155. arXiv: 1907.03626 . doi:10.1109/CogMI48466.2019.00029.
  7. Ehliar, Andreas; Liu, Dake. "Benchmarking network processors" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  8. Transaction Processing Performance Council (February 1998). "History and Overview of the TPC". TPC. Transaction Processing Performance Council . Retrieved 2018-07-02.

Further reading