Automatically Tuned Linear Algebra Software

Last updated
ATLAS
Repository
Type Software library
License BSD License
Website math-atlas.sourceforge.net

Automatically Tuned Linear Algebra Software (ATLAS) is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and Fortran77.

Contents

ATLAS is often recommended as a way to automatically generate an optimized BLAS library. While its performance often trails that of specialized libraries written for one specific hardware platform, it is often the first or even only optimized BLAS implementation available on new systems and is a large improvement over the generic BLAS available at Netlib. For this reason, ATLAS is sometimes used as a performance baseline for comparison with other products.

ATLAS runs on most Unix-like operating systems and on Microsoft Windows (using Cygwin). It is released under a BSD-style license without advertising clause, and many well-known mathematics applications including MATLAB, Mathematica, Scilab, SageMath, and some builds of GNU Octave may use it.

Functionality

ATLAS provides a full implementation of the BLAS APIs as well as some additional functions from LAPACK, a higher-level library built on top of BLAS. In BLAS, functionality is divided into three groups called levels 1, 2 and 3.

as well as scalar dot products and vector norms, among other things.
as well as solving for with being triangular, among other things.
as well as solving for triangular matrices , among other things.

Optimization approach

The optimization approach is called Automated Empirical Optimization of Software (AEOS), which identifies four fundamental approaches to computer assisted optimization of which ATLAS employs three: [1]

  1. Parameterization—searching over the parameter space of a function, used for blocking factor, cache edge, etc.
  2. Multiple implementation—searching through various approaches to implementing the same function, e.g., for SSE support before intrinsics made them available in C code
  3. Code generation—programs that write programs incorporating what knowledge they can about what will produce the best performance for the system
Every ATLAS level 1 BLAS function has its own kernel. Since it would be difficult to maintain thousands of cases in ATLAS there is little architecture specific optimization for Level 1 BLAS. Instead multiple implementation is relied upon to allow for compiler optimization to produce high performance implementation for the system.
With data and operations to perform the function is usually limited by bandwidth to memory, and thus there is not much opportunity for optimization
All routines in the ATLAS level 2 BLAS are built from two Level 2 BLAS kernels:
Since we have ops with only data, there are many opportunities for optimization

Level 3 BLAS

Most of the Level 3 BLAS is derived from GEMM, so that is the primary focus of the optimization.

operations vs. data

The intuition that the operations will dominate over the data accesses only works for roughly square matrices. The real measure should be some kind of surface area to volume. The difference becomes important for very non-square matrices.

Can it afford to copy?

Copying the inputs allows the data to be arranged in a way that provides optimal access for the kernel functions, but this comes at the cost of allocating temporary space, and an extra read and write of the inputs.

So the first question GEMM faces is, can it afford to copy the inputs?

If so,

If not,

The actual decision is made through a simple heuristic which checks for "skinny cases".

Cache edge

For 2nd Level Cache blocking a single cache edge parameter is used. The high level choose an order to traverse the blocks: ijk, jik, ikj, jki, kij, kji. These need not be the same order as the product is done within a block.

Typically chosen orders are ijk or jik. For jik the ideal situation would be to copy A and the NB wide panel of B. For ijk swap the role of A and B.

Choosing the bigger of M or N for the outer loop reduces the footprint of the copy. But for large K ATLAS does not even allocate such a large amount of memory. Instead it defines a parameter, Kp, to give best use of the L2 cache. Panels are limited to Kp in length. It first tries to allocate (in the jik case) . If that fails it tries . (If that fails it uses the no-copy version of GEMM, but this case is unlikely for reasonable choices of cache edge.) Kp is a function of cache edge and NB.

LAPACK

When integrating the ATLAS BLAS with LAPACK an important consideration is the choice of blocking factor for LAPACK. If the ATLAS blocking factor is small enough the blocking factor of LAPACK could be set to match that of ATLAS.

To take advantage of recursive factorization, ATLAS provides replacement routines for some LAPACK routines. These simply overwrite the corresponding LAPACK routines from Netlib.

Need for installation

Installing ATLAS on a particular platform is a challenging process which is typically done by a system vendor or a local expert and made available to a wider audience.

For many systems, architectural default parameters are available; these are essentially saved searches plus the results of hand tuning. If the arch defaults work they will likely get 10-15% better performance than the install search. On such systems the installation process is greatly simplified.

Related Research Articles

Principal component analysis Conversion of observations of possibly-correlated variables into values of fewer, uncorrelated variables

The principal components of a collection of points in a real coordinate space are a sequence of unit vectors, where the -th vector is the direction of a line that best fits the data while being orthogonal to the first vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. These directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

In continuum mechanics, the infinitesimal strain theory is a mathematical approach to the description of the deformation of a solid body in which the displacements of the material particles are assumed to be much smaller than any relevant dimension of the body; so that its geometry and the constitutive properties of the material at each point of space can be assumed to be unchanged by the deformation.

Unit quaternions, known as versors, provide a convenient mathematical notation for representing spatial orientations and rotations of elements in three dimensional space. Specifically, they encode information about an axis-angle rotation about an arbitrary axis. Rotation and orientation quaternions have applications in computer graphics, computer vision, robotics, navigation, molecular dynamics, flight dynamics, orbital mechanics of satellites, and crystallographic texture analysis.

In mathematical logic, descriptive set theory (DST) is the study of certain classes of "well-behaved" subsets of the real line and other Polish spaces. As well as being one of the primary areas of research in set theory, it has applications to other areas of mathematics such as functional analysis, ergodic theory, the study of operator algebras and group actions, and mathematical logic.

Projection (linear algebra)

In linear algebra and functional analysis, a projection is a linear transformation from a vector space to itself such that . That is, whenever is applied twice to any value, it gives the same result as if it were applied once (idempotent). It leaves its image unchanged. Though abstract, this definition of "projection" formalizes and generalizes the idea of graphical projection. One can also consider the effect of a projection on a geometrical object by examining the effect of the projection on points in the object.

In numerical linear algebra, a Givens rotation is a rotation in the plane spanned by two coordinates axes. Givens rotations are named after Wallace Givens, who introduced them to numerical analysts in the 1950s while he was working at Argonne National Laboratory.

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

Two-state quantum system Quantum system that can be measured as one of two values; sought for "quantum bits" in quantum computing

In quantum mechanics, a two-state system is a quantum system that can exist in any quantum superposition of two independent quantum states. The Hilbert space describing such a system is two-dimensional. Therefore, a complete basis spanning the space will consist of two independent states. Any two-state system can also be seen as a qubit.

Electromagnetic tensor

In electromagnetism, the electromagnetic tensor or electromagnetic field tensor is a mathematical object that describes the electromagnetic field in spacetime. The field tensor was first used after the four-dimensional tensor formulation of special relativity was introduced by Hermann Minkowski. The tensor allows related physical laws to be written very concisely.

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the descent direction by preconditioning the gradient with curvature information. It does so by gradually improving an approximation to the Hessian matrix of the loss function, obtained only from gradient evaluations via a generalized secant method.

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method, reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956. In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function.

In mathematics, more specifically in numerical linear algebra, the biconjugate gradient method is an algorithm to solve systems of linear equations

In statistics, a central composite design is an experimental design, useful in response surface methodology, for building a second order (quadratic) model for the response variable without needing to use a complete three-level factorial experiment.

In geometry, various formalisms exist to express a rotation in three dimensions as a mathematical transformation. In physics, this concept is applied to classical mechanics where rotational kinematics is the science of quantitative description of a purely rotational motion. The orientation of an object at a given instant is described with the same tools, as it is defined as an imaginary rotation from a reference placement in space, rather than an actually observed rotation from a previous placement in space.

In classical mechanics, the Udwadia–Kalaba formulation is a method for deriving the equations of motion of a constrained mechanical system. The method was first described by Firdaus E. Udwadia and Robert E. Kalaba in 1992. The approach is based on Gauss's principle of least constraint. The Udwadia–Kalaba method applies to both holonomic constraints and nonholonomic constraints, as long as they are linear with respect to the accelerations. The method generalizes to constraint forces that do not obey D'Alembert's principle.

In analytical mechanics, the mass matrix is a symmetric matrix M that expresses the connection between the time derivative of the generalized coordinate vector q of a system and the kinetic energy T of that system, by the equation

Multivariate stable distribution

The multivariate stable distribution is a multivariate probability distribution that is a multivariate generalisation of the univariate stable distribution. The multivariate stable distribution defines linear relations between stable distribution marginals. In the same way as for the univariate case, the distribution is defined in terms of its characteristic function.

In mathematics, the Kodaira–Spencer map, introduced by Kunihiko Kodaira and Donald C. Spencer, is a map associated to a deformation of a scheme or complex manifold X, taking a tangent space of a point of the deformation space to the first cohomology group of the sheaf of vector fields on X.

Graph cut optimization is a combinatorial optimization method applicable to a family of functions of discrete variables, named after the concept of cut in the theory of flow networks. Thanks to the max-flow min-cut theorem, determining the minimum cut over a graph representing a flow network is equivalent to computing the maximum flow over the network. Given a pseudo-Boolean function , if it is possible to construct a flow network with positive weights such that

Tau functions are an important ingredient in the modern theory of integrable systems, and have numerous applications in a variety of other domains. They were originally introduced by Ryogo Hirota in his direct method approach to soliton equations, based on expressing them in an equivalent bilinear form. The term Tau function, or -function, was first used systematically by Mikio Sato and his students in the specific context of the Kadomtsev–Petviashvili equation, and related integrable hierarchies. It is a central ingredient in the theory of solitons. Tau functions also appear as matrix model partition functions in the spectral theory of Random Matrices, and may also serve as generating functions, in the sense of combinatorics and enumerative geometry, especially in relation to moduli spaces of Riemann surfaces, and enumeration of branched coverings, or so-called Hurwitz numbers.

References

  1. R. Clint Whaley; Antoine Petitet & Jack J. Dongarra (2001). "Automated Empirical Optimization of Software and the ATLAS Project" (PDF). Parallel Computing. 27 (1–2): 3–35. CiteSeerX   10.1.1.35.2297 . doi:10.1016/S0167-8191(00)00087-9 . Retrieved 2006-10-06.