Speedup

Last updated

In computer architecture, speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with different resources. The notion of speedup was established by Amdahl's law, which was particularly focused on parallel processing. However, speedup can be used more generally to show the effect on performance after any resource enhancement.

Contents

Definitions

Speedup can be defined for two different types of quantities: latency and throughput . [1]

Latency of an architecture is the reciprocal of the execution speed of a task:

where

Throughput of an architecture is the execution rate of a task:

where

Latency is often measured in seconds per unit of execution workload. Throughput is often measured in units of execution workload per second. Another unit of throughput is instructions per cycle (IPC) and its reciprocal, cycles per instruction (CPI), is another unit of latency.

Speedup is dimensionless and defined differently for each type of quantity so that it is a consistent metric.

Speedup in latency

Speedup in latency is defined by the following formula: [2]

where

Speedup in latency can be predicted from Amdahl's law or Gustafson's law.

Speedup in throughput

Speedup in throughput is defined by the formula: [3]

where

Examples

Using execution times

We are testing the effectiveness of a branch predictor on the execution of a program. First, we execute the program with the standard branch predictor on the processor, which yields an execution time of 2.25 seconds. Next, we execute the program with our modified (and hopefully improved) branch predictor on the same processor, which produces an execution time of 1.50 seconds. In both cases the execution workload is the same. Using our speedup formula, we know

Our new branch predictor has provided a 1.5x speedup over the original.

Using cycles per instruction and instructions per cycle

We can also measure speedup in cycles per instruction (CPI) which is a latency. First, we execute the program with the standard branch predictor, which yields a CPI of 3. Next, we execute the program with our modified branch predictor, which yields a CPI of 2. In both cases the execution workload is the same and both architectures are not pipelined nor parallel. Using the speedup formula gives

We can also measure speedup in instructions per cycle (IPC), which is a throughput and the inverse of CPI. Using the speedup formula gives

We achieve the same 1.5x speedup, though we measured different quantities.

Additional details

Let S be the speedup of execution of a task and s the speedup of execution of the part of the task that benefits from the improvement of the resources of an architecture. Linear speedup or ideal speedup is obtained when S = s. When running a task with linear speedup, doubling the local speedup doubles the overall speedup. As this is ideal, it is considered very good scalability.

Efficiency is a metric of the utilization of the resources of the improved system defined as

Its value is typically between 0 and 1. Programs with linear speedup and programs running on a single processor have an efficiency of 1, while many difficult-to-parallelize programs have efficiency such as 1/ln(s)[ citation needed ] that approaches 0 as the number of processors A = s increases.

In engineering contexts, efficiency curves are more often used for graphs than speedup curves, since

In marketing contexts, speedup curves are more often used, largely because they go up and to the right and thus appear better to the less-informed.

Super-linear speedup

Sometimes a speedup of more than A when using A processors is observed in parallel computing, which is called super-linear speedup. Super-linear speedup rarely happens and often confuses beginners, who believe the theoretical maximum speedup should be A when A processors are used.

One possible reason for super-linear speedup in low-level computations is the cache effect resulting from the different memory hierarchies of a modern computer: in parallel computing, not only do the numbers of processors change, but so does the size of accumulated caches from different processors. With the larger accumulated cache size, more or even all of the working set can fit into caches and the memory access time reduces dramatically, which causes the extra speedup in addition to that from the actual computation. [4]

An analogous situation occurs when searching large datasets, such as the genomic data searched by BLAST implementations. There the accumulated RAM from each of the nodes in a cluster enables the dataset to move from disk into RAM thereby drastically reducing the time required by e.g. mpiBLAST to search it. [5]

Super-linear speedups can also occur when performing backtracking in parallel: an exception in one thread can cause several other threads to backtrack early, before they reach the exception themselves. [6]

Super-linear speedups can also occur in parallel implementations of branch-and-bound for optimization: [7] the processing of one node by one processor may affect the work other processors need to do for the other nodes.

See also

Related Research Articles

<span class="mw-page-title-main">Amdahl's law</span> Formula in computer architecture

In computer architecture, Amdahl's law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used". It is named after computer scientist Gene Amdahl, and was presented at the American Federation of Information Processing Societies (AFIPS) Spring Joint Computer Conference in 1967.

In physics, a partition function describes the statistical properties of a system in thermodynamic equilibrium. Partition functions are functions of the thermodynamic state variables, such as the temperature and volume. Most of the aggregate thermodynamic variables of the system, such as the total energy, free energy, entropy, and pressure, can be expressed in terms of the partition function or its derivatives. The partition function is dimensionless.

In viscous fluid dynamics, the Archimedes number (Ar), is a dimensionless number used to determine the motion of fluids due to density differences, named after the ancient Greek scientist and mathematician Archimedes.

<span class="mw-page-title-main">Onsager reciprocal relations</span> Relations between flows and forces, or gradients, in thermodynamic systems

In thermodynamics, the Onsager reciprocal relations express the equality of certain ratios between flows and forces in thermodynamic systems out of equilibrium, but where a notion of local equilibrium exists.

The representation theory of groups is a part of mathematics which examines how groups act on given structures.

<span class="mw-page-title-main">Sheet resistance</span> Electrical resistance of a thin film

Sheet resistance, is the resistance of a square piece of a thin material with contacts made to two opposite sides of the square. It is usually a measurement of electrical resistance of thin films that are uniform in thickness. It is commonly used to characterize materials made by semiconductor doping, metal deposition, resistive paste printing, and glass coating. Examples of these processes are: doped semiconductor regions, and the resistors that are screen printed onto the substrates of thick-film hybrid microcircuits.

<span class="mw-page-title-main">Euler–Bernoulli beam theory</span> Method for load calculation in construction

Euler–Bernoulli beam theory is a simplification of the linear theory of elasticity which provides a means of calculating the load-carrying and deflection characteristics of beams. It covers the case corresponding to small deflections of a beam that is subjected to lateral loads only. By ignoring the effects of shear deformation and rotatory inertia, it is thus a special case of Timoshenko–Ehrenfest beam theory. It was first enunciated circa 1750, but was not applied on a large scale until the development of the Eiffel Tower and the Ferris wheel in the late 19th century. Following these successful demonstrations, it quickly became a cornerstone of engineering and an enabler of the Second Industrial Revolution.

<span class="mw-page-title-main">Charge density</span> Electric charge per unit length, area or volume

In electromagnetism, charge density is the amount of electric charge per unit length, surface area, or volume. Volume charge density is the quantity of charge per unit volume, measured in the SI system in coulombs per cubic meter (C⋅m−3), at any point in a volume. Surface charge density (σ) is the quantity of charge per unit area, measured in coulombs per square meter (C⋅m−2), at any point on a surface charge distribution on a two dimensional surface. Linear charge density (λ) is the quantity of charge per unit length, measured in coulombs per meter (C⋅m−1), at any point on a line charge distribution. Charge density can be either positive or negative, since electric charge can be either positive or negative.

<span class="mw-page-title-main">Gustafson's law</span> Theoretical speedup formula in computer architecture

In computer architecture, Gustafson's law gives the speedup in the execution time of a task that theoretically gains from parallel computing, using a hypothetical run of the task on a single-core machine as the baseline. To put it another way, it is the theoretical "slowdown" of an already parallelized task if running on a serial machine. It is named after computer scientist John L. Gustafson and his colleague Edwin H. Barsis, and was presented in the article Reevaluating Amdahl's Law in 1988.

In quantum information theory, quantum relative entropy is a measure of distinguishability between two quantum states. It is the quantum mechanical analog of relative entropy.

<span class="mw-page-title-main">Autoencoder</span> Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In queueing theory, a discipline within the mathematical theory of probability, the Pollaczek–Khinchine formula states a relationship between the queue length and service time distribution Laplace transforms for an M/G/1 queue. The term is also used to refer to the relationships between the mean queue length and mean waiting/service time in such a model.

The Clausius–Duhem inequality is a way of expressing the second law of thermodynamics that is used in continuum mechanics. This inequality is particularly useful in determining whether the constitutive relation of a material is thermodynamically allowable.

Within theoretical computer science, the Sun–Ni law is a memory-bounded speedup model which states that as computing power increases the corresponding increase in problem size is constrained by the system’s memory capacity. In general, as a system grows in computational power, the problems run on the system increase in size. Analogous to Amdahl's law, which says that the problem size remains constant as system sizes grow, and Gustafson's law, which proposes that the problem size should scale but be bound by a fixed amount of time, the Sun–Ni law states the problem size should scale but be bound by the memory capacity of the system. Sun–Ni law was initially proposed by Xian-He Sun and Lionel Ni at the Proceedings of IEEE Supercomputing Conference 1990.

In queueing theory, a discipline within the mathematical theory of probability, the M/M/c queue is a multi-server queueing model. In Kendall's notation it describes a system where arrivals form a single queue and are governed by a Poisson process, there are c servers, and job service times are exponentially distributed. It is a generalisation of the M/M/1 queue which considers only a single server. The model with infinitely many servers is the M/M/∞ queue.

<span class="mw-page-title-main">Reynolds number</span> Ratio of inertial to viscous forces acting on a liquid

In fluid mechanics, the Reynolds number is a dimensionless quantity that helps predict fluid flow patterns in different situations by measuring the ratio between inertial and viscous forces. At low Reynolds numbers, flows tend to be dominated by laminar (sheet-like) flow, while at high Reynolds numbers, flows tend to be turbulent. The turbulence results from differences in the fluid's speed and direction, which may sometimes intersect or even move counter to the overall direction of the flow. These eddy currents begin to churn the flow, using up energy in the process, which for liquids increases the chances of cavitation.

An electric dipole transition is the dominant effect of an interaction of an electron in an atom with the electromagnetic field.

<span class="mw-page-title-main">Sea ice growth processes</span>

Sea ice is a complex composite composed primarily of pure ice in various states of crystallization, but including air bubbles and pockets of brine. Understanding its growth processes is important for climate modellers and remote sensing specialists, since the composition and microstructural properties of the ice affect how it reflects or absorbs sunlight.

Data-driven control systems are a broad family of control systems, in which the identification of the process model and/or the design of the controller are based entirely on experimental data collected from the plant.

<span class="mw-page-title-main">Diffusion model</span> Deep learning algorithm

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure. The goal of diffusion models is to learn a diffusion process that generates the probability distribution of a given dataset. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.

References

  1. Martin, Milo. "Performance and Benchmarking" (PDF). Retrieved 5 June 2014.
  2. Hennessy, John L.; David A., Patterson (2012). Computer Architecture: A Quantitive Approach . Waltham, MA: Morgan Kaufmann. pp.  46–47. ISBN   978-0-12-383872-8.
  3. Baer, Jean-Loup (2010). Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors . New York: Cambridge University Press. pp.  10. ISBN   978-0-521-76992-1.
  4. Benzi, John; Damodaran, M. (2007). "Parallel Three Dimensional Direct Simulation Monte Carlo for Simulating Micro Flows". Parallel Computational Fluid Dynamics 2007: Implementations and Experiences on Large Scale and Grid Computing. Parallel Computational Fluid Dynamics. Springer. p. 95. Retrieved 2013-03-21.
  5. http://people.cs.vt.edu/~feng/presentations/030903-ParCo.pdf [ bare URL PDF ]
  6. Speckenmeyer, Ewald (1988). "Superlinear speedup for parallel backtracking". Supercomputing. Lecture Notes in Computer Science. Vol. 297. pp. 985–993. doi:10.1007/3-540-18991-2_58. ISBN   978-3-540-18991-6.
  7. "Gurobi versus CPLEX benchmarks". cmu.edu. 29 January 2009. Retrieved 23 April 2018.