BLIS (software)

Last updated
BLIS
Original author(s) Science of High-Performance Computing (SHPC) group, UT-Austin
Developer(s) Field Van Zee and Devin Matthews
Initial releaseNovember 9, 2013;10 years ago (2013-11-09)
Stable release
1.0 / May 6, 2024;3 months ago (2024-05-06) [1]
Repository
Operating system Linux
Microsoft Windows
macOS
FreeBSD
Platform x86-64
ARM
ARM64
...
Type Linear algebra library; implementation of BLAS
License new/modified/3-clause BSD License
Website www.github.com/flame/blis   OOjs UI icon edit-ltr-progressive.svg

In scientific computing, BLIS (BLAS-like Library Instantiation Software) [2] [3] [4] [5] is an open-source framework for implementing a superset of BLAS (Basic Linear Algebra Subprograms) functionality for specific processor types that was awarded the J. H. Wilkinson Prize for Numerical Software in 2023. [6] It exposes that functionality through two traditional Application Programming Interfaces (APIs): the BLAS interface and the CBLAS interface. BLIS also includes two APIs native to the framework: a typed (BLAS-like) API and an object API. These native interfaces provide access to BLAS-like functionality that is not supported by, but closely related to, operations found in the BLAS (and CBLAS). The framework is developed and supported by the Science of High-Performance Computing (SHPC) group of the Oden Institute for Computational Engineering and Sciences at The University of Texas at Austin and the Matthews Research Group at Southern Methodist University.

Contents

BLIS yields high performance on many current CPU microarchitectures in both single-threaded and multithreaded modes of execution. [7] BLIS also offers competitive performance for some cases of matrix multiplication in which one or more matrix operands are unusually skinny and/or small. [8]

The framework achieves high performance by employing specialized kernels (typically written in GNU extended inline assembly syntax) along with cache and register blocking through matrix operands. BLIS also works on processors for which custom kernels have not yet been written; in those cases, the framework relies upon portable kernel implementations that perform at a lower rate of computation.

BLIS is sometimes described as a refactoring of GotoBLAS2, which was created by Kazushige Goto at the Texas Advanced Computing Center. [9]

See also

Related Research Articles

L4 is a family of second-generation microkernels, used to implement a variety of types of operating systems (OS), though mostly for Unix-like, Portable Operating System Interface (POSIX) compliant types.

In computer programming, a software framework is an abstraction in which software, providing generic functionality, can be selectively changed by additional user-written code, thus providing application-specific software. It provides a standard way to build and deploy applications and is a universal, reusable software environment that provides particular functionality as part of a larger software platform to facilitate the development of software applications, products and solutions.

Morphic is an interface construction environment which uses graphical objects called "Morphs" for simplified GUI-building which allow for flexibility and dynamism. It was originally created for Self, but later, was ported to other programming languages like Squeak, JavaScript, Python, and Objective-C.

<span class="mw-page-title-main">LAPACK</span> Software library for numerical linear algebra

LAPACK is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition. LAPACK was originally written in FORTRAN 77, but moved to Fortran 90 in version 3.2 (2008). The routines handle both real and complex matrices in both single and double precision. LAPACK relies on an underlying BLAS implementation to provide efficient and portable computational building blocks for its routines.

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

Kazushige Gotō is a software engineer specializing in high performance, hand-written, machine code.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations. CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

<span class="mw-page-title-main">OpenCL</span> Open standard for programming heterogenous computing systems, such as CPUs or GPUs

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

Intel oneAPI Math Kernel Library, formerly know as Intel Math Kernel Library, is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math.

UMFPACK is a set of routines for solving unsymmetric sparse linear systems of the form Ax=b, using the Unsymmetric MultiFrontal method. Written in ANSI/ISO C and interfaces with

IT++ is a C++ library of classes and functions for linear algebra, numerical optimization, signal processing, communications, and statistics. It is being developed by researchers in these areas and is widely used by researchers, both in the communications industry and universities. The IT++ library originates from the former Department of Information Theory at the Chalmers University of Technology, Gothenburg, Sweden.

In scientific computing, GotoBLAS and GotoBLAS2 are open source implementations of the BLAS API with many hand-crafted optimizations for specific processor types. GotoBLAS was developed by Kazushige Goto at the Texas Advanced Computing Center. As of 2003, it was used in seven of the world's ten fastest supercomputers.

OpenBLAS is an open-source implementation of the BLAS and LAPACK APIs with many hand-crafted optimizations for specific processor types. It is developed at the Lab of Parallel Software and Computational Science, ISCAS.

<span class="mw-page-title-main">Richard Vuduc</span>

Richard Vuduc is a tenured professor of computer science at the Georgia Institute of Technology. His research lab, The HPC Garage, studies high-performance computing, scientific computing, parallel algorithms, modeling, and engineering. He is a member of the Association for Computing Machinery (ACM). As of 2022, Vuduc serves as Vice President of the SIAM Activity Group on Supercomputing. He has co-authored over 200 articles in peer-reviewed journals and conferences.

<span class="mw-page-title-main">ROCm</span> Parallel computing platform: GPGPU libraries and application programming interface

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP, and OpenCL.

<span class="mw-page-title-main">GraphBLAS</span> API for graph data and graph operations

GraphBLAS is an API specification that defines standard building blocks for graph algorithms in the language of linear algebra. GraphBLAS is built upon the notion that a sparse matrix can be used to represent graphs as either an adjacency matrix or an incidence matrix. The GraphBLAS specification describes how graph operations can be efficiently implemented via linear algebraic methods over different semirings.

CuPy is an open source library for GPU-accelerated computing with Python programming language, providing support for multi-dimensional arrays, sparse matrices, and a variety of numerical algorithms implemented on top of them. CuPy shares the same API set as NumPy and SciPy, allowing it to be a drop-in replacement to run NumPy/SciPy code on GPU. CuPy supports Nvidia CUDA GPU platform, and AMD ROCm GPU platform starting in v9.0.

References

  1. Releases · flame/blis – GitHub
  2. Van Zee, Field; van de Geijn, Robert (2015). "BLIS: A Framework for Rapidly Instantiating BLAS Functionality". ACM Transactions on Mathematical Software. 41 (3): 1–33. doi:10.1145/2764454.
  3. Van Zee, Field; Smith, Tyler; Igual, Francisco; Smelyanskiy, Mikhail; Zhang, Xiangyi; Kistler, Michael; Austel, Vernon; Gunnels, John; Low, Tze Meng; Marker, Bryan; Killough, Lee; van de Geijn, Robert (2016). "The BLIS Framework: Experiments in Portability". ACM Transactions on Mathematical Software. 42 (2): 1–19. doi: 10.1145/2755561 .
  4. Smith, Tyler M.; van de Geijn, Robert; Smelyanskiy, Mikhail; Hammond, Jeff R.; Van Zee, Field G. (2014). "Anatomy of High-Performance Many-Threaded Matrix Multiplication". 2014 IEEE 28th International Parallel and Distributed Processing Symposium. pp. 1049–1059. doi:10.1109/IPDPS.2014.110. ISBN   978-1-4799-3800-1.
  5. Low, Tze Meng; Igual, Francisco; Smith, Tyler; Quintana, Enrique (2016). "Analytical Modeling is Enough for High-Performance BLIS". ACM Transactions on Mathematical Software. 43 (2): 1–18. doi:10.1145/2925987. hdl: 10234/163618 .
  6. James H. Wilkinson Prize for Numerical Software, SIAM · Prizes & Recognition · Major Prizes & Lectures.
  7. Performance.md, flame/blis on GitHub.
  8. PerformanceSmall.md, flame/blis on GitHub.
  9. Goto, Kazushige; van de Geijn, Robert A. (2008). "Anatomy of high-performance matrix multiplication". ACM Transactions on Mathematical Software. 34 (3): 1–25. doi:10.1145/1356052.1356053.