Array programming

Last updated

In computer science, array programming refers to solutions that allow the application of operations to an entire set of values at once. Such solutions are commonly used in scientific and engineering settings.

Contents

Modern programming languages that support array programming (also known as vector or multidimensional languages) have been engineered specifically to generalize operations on scalars to apply transparently to vectors, matrices, and higher-dimensional arrays. These include APL, J, Fortran, MATLAB, Analytica, Octave, R, Cilk Plus, Julia, Perl Data Language (PDL). In these languages, an operation that operates on entire arrays can be called a vectorized operation, [1] regardless of whether it is executed on a vector processor, which implements vector instructions. Array programming primitives concisely express broad ideas about data manipulation. The level of concision can be dramatic in certain cases: it is not uncommon[ example needed ] to find array programming language one-liners that require several pages of object-oriented code.

Concepts of array

The fundamental idea behind array programming is that operations apply at once to an entire set of values. This makes it a high-level programming model as it allows the programmer to think and operate on whole aggregates of data, without having to resort to explicit loops of individual scalar operations.

Kenneth E. Iverson described the rationale behind array programming (actually referring to APL) as follows: [2]

most programming languages are decidedly inferior to mathematical notation and are little used as tools of thought in ways that would be considered significant by, say, an applied mathematician.

The thesis is that the advantages of executability and universality found in programming languages can be effectively combined, in a single coherent language, with the advantages offered by mathematical notation. it is important to distinguish the difficulty of describing and of learning a piece of notation from the difficulty of mastering its implications. For example, learning the rules for computing a matrix product is easy, but a mastery of its implications (such as its associativity, its distributivity over addition, and its ability to represent linear functions and geometric operations) is a different and much more difficult matter.

Indeed, the very suggestiveness of a notation may make it seem harder to learn because of the many properties it suggests for explorations.

[...]

Users of computers and programming languages are often concerned primarily with the efficiency of execution of algorithms, and might, therefore, summarily dismiss many of the algorithms presented here. Such dismissal would be short-sighted since a clear statement of an algorithm can usually be used as a basis from which one may easily derive a more efficient algorithm.

The basis behind array programming and thinking is to find and exploit the properties of data where individual elements are similar or adjacent. Unlike object orientation which implicitly breaks down data to its constituent parts (or scalar quantities), array orientation looks to group data and apply a uniform handling.

Function rank is an important concept to array programming languages in general, by analogy to tensor rank in mathematics: functions that operate on data may be classified by the number of dimensions they act on. Ordinary multiplication, for example, is a scalar ranked function because it operates on zero-dimensional data (individual numbers). The cross product operation is an example of a vector rank function because it operates on vectors, not scalars. Matrix multiplication is an example of a 2-rank function, because it operates on 2-dimensional objects (matrices). Collapse operators reduce the dimensionality of an input data array by one or more dimensions. For example, summing over elements collapses the input array by 1 dimension.

Uses

Array programming is very well suited to implicit parallelization; a topic of much research nowadays. Further, Intel and compatible CPUs developed and produced after 1997 contained various instruction set extensions, starting from MMX and continuing through SSSE3 and 3DNow!, which include rudimentary SIMD array capabilities. This has continued into the 2020s with instruction sets such as AVX-512, making modern CPUs sophisticated vector processors. Array processing is distinct from parallel processing in that one physical processor performs operations on a group of items simultaneously while parallel processing aims to split a larger problem into smaller ones (MIMD) to be solved piecemeal by numerous processors. Processors with multiple cores and GPUs with thousands of general computing cores are common as of 2023.

Languages

The canonical examples of array programming languages are Fortran, APL, and J. Others include: A+, Analytica, Chapel, IDL, Julia, K, Klong, Q, MATLAB, GNU Octave, Scilab, FreeMat, PDL, R, S-Lang, SAC, Nial, ZPL, Futhark, and TI-BASIC.

Scalar languages

In scalar languages such as C and Pascal, operations apply only to single values, so a+b expresses the addition of two numbers. In such languages, adding one array to another requires indexing and looping, the coding of which is tedious.

for(i=0;i<n;i++)for(j=0;j<n;j++)a[i][j]+=b[i][j];

In array-based languages, for example in Fortran, the nested for-loop above can be written in array-format in one line,

a=a+b

or alternatively, to emphasize the array nature of the objects,

a(:,:)=a(:,:)+b(:,:)

While scalar languages like C do not have native array programming elements as part of the language proper, this does not mean programs written in these languages never take advantage of the underlying techniques of vectorization (i.e., utilizing a CPU's vector-based instructions if it has them or by using multiple CPU cores). Some C compilers like GCC at some optimization levels detect and vectorize sections of code that its heuristics determine would benefit from it. Another approach is given by the OpenMP API, which allows one to parallelize applicable sections of code by taking advantage of multiple CPU cores.

Array languages

In array languages, operations are generalized to apply to both scalars and arrays. Thus, a+b expresses the sum of two scalars if a and b are scalars, or the sum of two arrays if they are arrays.

An array language simplifies programming but possibly at a cost known as the abstraction penalty. [3] [4] [5] Because the additions are performed in isolation from the rest of the coding, they may not produce the optimally most efficient code. (For example, additions of other elements of the same array may be subsequently encountered during the same execution, causing unnecessary repeated lookups.) Even the most sophisticated optimizing compiler would have an extremely hard time amalgamating two or more apparently disparate functions which might appear in different program sections or sub-routines, even though a programmer could do this easily, aggregating sums on the same pass over the array to minimize overhead).

Ada

The previous C code would become the following in the Ada language, [6] which supports array-programming syntax.

A:=A+B;

APL

APL uses single character Unicode symbols with no syntactic sugar.

AA+B

This operation works on arrays of any rank (including rank 0), and on a scalar and an array. Dyalog APL extends the original language with augmented assignments:

A+B

Analytica

Analytica provides the same economy of expression as Ada.

A := A + B; 

BASIC

Dartmouth BASIC had MAT statements for matrix and array manipulation in its third edition (1966).

DIMA(4),B(4),C(4)MATA=1MATB=2*AMATC=A+BMATPRINTA,B,C

Mata

Stata's matrix programming language Mata supports array programming. Below, we illustrate addition, multiplication, addition of a matrix and a scalar, element by element multiplication, subscripting, and one of Mata's many inverse matrix functions.

. mata:  : A = (1,2,3) \(4,5,6)  : A        123+-------------+1 |  123  |   2 |  456  |     +-------------+  : B = (2..4) \(1..3)  : B        123+-------------+1 |  234  |   2 |  123  |     +-------------+  : C = J(3,2,1)           // A 3 by 2 matrix of ones  : C        12+---------+1 |  11  |   2 |  11  |   3 |  11  |     +---------+  : D = A + B  : D        123+-------------+1 |  357  |   2 |  579  |     +-------------+  : E = A*C  : E         12+-----------+1 |   66  |   2 |  1515  |     +-----------+  : F = A:*B  : F         123+----------------+1 |   2612  |   2 |   41018  |     +----------------+  : G = E :+3  : G         12+-----------+1 |   99  |   2 |  1818  |     +-----------+  : H = F[(2\1), (1, 2)]    // Subscripting to get a submatrix of F and  :                         // switch row 1 and 2 : H         12+-----------+1 |   410  |   2 |   26  |     +-----------+  : I = invsym(F'*F)        // Generalized inverse (F*F^(-1)F=F) of a  :                         // symmetric positive semi-definite matrix : I [symmetric]                  123+-------------------------------------------+1 |            0                              |   2 |            03.25                |   3 |            0-1.75   .9444444444  |     +-------------------------------------------+  : end

MATLAB

The implementation in MATLAB allows the same economy allowed by using the Fortran language.

A=A+B;

A variant of the MATLAB language is the GNU Octave language, which extends the original language with augmented assignments:

A+=B;

Both MATLAB and GNU Octave natively support linear algebra operations such as matrix multiplication, matrix inversion, and the numerical solution of system of linear equations, even using the Moore–Penrose pseudoinverse. [7] [8]

The Nial example of the inner product of two arrays can be implemented using the native matrix multiplication operator. If a is a row vector of size [1 n] and b is a corresponding column vector of size [n 1].

a * b;

By contrast, the entrywise product is implemented as:

a .* b;

The inner product between two matrices having the same number of elements can be implemented with the auxiliary operator (:), which reshapes a given matrix into a column vector, and the transpose operator ':

A(:)' * B(:);

rasql

The rasdaman query language is a database-oriented array-programming language. For example, two arrays could be added with the following query:

SELECTA+BFROMA,B

R

The R language supports array paradigm by default. The following example illustrates a process of multiplication of two matrices followed by an addition of a scalar (which is, in fact, a one-element vector) and a vector:

> A<-matrix(1:6,nrow=2)# !!this has nrow=2 ... and A has 2 rows> A     [,1] [,2] [,3][1,]    1    3    5[2,]    2    4    6> B<-t(matrix(6:1,nrow=2))# t() is a transpose operator                           !!this has nrow=2 ... and B has 3 rows --- a clear contradiction to the definition of A> B     [,1] [,2][1,]    6    5[2,]    4    3[3,]    2    1> C<-A%*%B> C     [,1] [,2][1,]   28   19[2,]   40   28> D<-C+1> D     [,1] [,2][1,]   29   20[2,]   41   29> D+c(1,1)# c() creates a vector     [,1] [,2][1,]   30   21[2,]   42   30

Mathematical reasoning and language notation

The matrix left-division operator concisely expresses some semantic properties of matrices. As in the scalar equivalent, if the (determinant of the) coefficient (matrix) A is not null then it is possible to solve the (vectorial) equation A * x = b by left-multiplying both sides by the inverse of A: A−1 (in both MATLAB and GNU Octave languages: A^-1). The following mathematical statements hold when A is a full rank square matrix:

A^-1 *(A * x)==A^-1 * (b)
(A^-1 * A)* x ==A^-1 * b     (matrix-multiplication associativity)
x = A^-1 * b

where == is the equivalence relational operator. The previous statements are also valid MATLAB expressions if the third one is executed before the others (numerical comparisons may be false because of round-off errors).

If the system is overdetermined – so that A has more rows than columns – the pseudoinverse A+ (in MATLAB and GNU Octave languages: pinv(A)) can replace the inverse A−1, as follows:

pinv(A)*(A*x)==pinv(A)*(b)
(pinv(A)*A)*x==pinv(A)*b    (matrix-multiplication associativity)
x=pinv(A)*b

However, these solutions are neither the most concise ones (e.g. still remains the need to notationally differentiate overdetermined systems) nor the most computationally efficient. The latter point is easy to understand when considering again the scalar equivalent a * x = b, for which the solution x = a^-1 * b would require two operations instead of the more efficient x = b / a. The problem is that generally matrix multiplications are not commutative as the extension of the scalar solution to the matrix case would require:

(a * x)/ a ==b / a
(x * a)/ a ==b / a    (commutativity does not hold for matrices!)
x * (a / a)==b / a    (associativity also holds for matrices)
x = b / a

The MATLAB language introduces the left-division operator \ to maintain the essential part of the analogy with the scalar case, therefore simplifying the mathematical reasoning and preserving the conciseness:

A \ (A * x)==A \ b
(A \ A)* x ==A \ b    (associativity also holds for matrices, commutativity is no more required)
x = A \ b

This is not only an example of terse array programming from the coding point of view but also from the computational efficiency perspective, which in several array programming languages benefits from quite efficient linear algebra libraries such as ATLAS or LAPACK. [9]

Returning to the previous quotation of Iverson, the rationale behind it should now be evident:

it is important to distinguish the difficulty of describing and of learning a piece of notation from the difficulty of mastering its implications. For example, learning the rules for computing a matrix product is easy, but a mastery of its implications (such as its associativity, its distributivity over addition, and its ability to represent linear functions and geometric operations) is a different and much more difficult matter. Indeed, the very suggestiveness of a notation may make it seem harder to learn because of the many properties it suggests for explorations.

Third-party libraries

The use of specialized and efficient libraries to provide more terse abstractions is also common in other programming languages. In C++ several linear algebra libraries exploit the language's ability to overload operators. In some cases a very terse abstraction in those languages is explicitly influenced by the array programming paradigm, as the NumPy extension library to Python, Armadillo and Blitz++ libraries do. [10] [11]

See also

Related Research Articles

<span class="mw-page-title-main">APL (programming language)</span> Functional programming language for arrays

APL is a programming language developed in the 1960s by Kenneth E. Iverson. Its central datatype is the multidimensional array. It uses a large range of special graphic symbols to represent most functions and operators, leading to very concise code. It has been an important influence on the development of concept modeling, spreadsheets, functional programming, and computer math packages. It has also inspired several other programming languages.

<span class="mw-page-title-main">Binary operation</span> Mathematical operation with two operands

In mathematics, a binary operation or dyadic operation is a rule for combining two elements to produce another element. More formally, a binary operation is an operation of arity two.

<span class="mw-page-title-main">MATLAB</span> Numerical computing environment and programming language

MATLAB is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages.

<span class="mw-page-title-main">GNU Octave</span> Numerical analysis programming language

GNU Octave is a scientific programming language for scientific computing and numerical computation. Octave helps in solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with MATLAB. It may also be used as a batch-oriented language. As part of the GNU Project, it is free software under the terms of the GNU General Public License.

In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal matrix is , while an example of a 3×3 diagonal matrix is. An identity matrix of any size, or any multiple of it is a diagonal matrix called scalar matrix, for example, . In geometry, a diagonal matrix may be used as a scaling matrix, since matrix multiplication with it results in changing scale (size) and possibly also shape; only a scalar matrix results in uniform change in scale.

In mathematics, an algebra over a field is a vector space equipped with a bilinear product. Thus, an algebra is an algebraic structure consisting of a set together with operations of multiplication and addition and scalar multiplication by elements of a field and satisfying the axioms implied by "vector space" and "bilinear".

Perl Data Language is a set of free software array programming extensions to the Perl programming language. PDL extends the data structures built into Perl, to include large multidimensional arrays, and adds functionality to manipulate those arrays as vector objects. It also provides tools for image processing, machine learning, computer modeling of physical systems, and graphical plotting and presentation. Simple operations are automatically vectorized across complete arrays, and higher-dimensional operations are supported.

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

<span class="mw-page-title-main">Row- and column-major order</span> Array representation in computer memory

In computing, row-major order and column-major order are methods for storing multidimensional arrays in linear storage such as random access memory.

The programming language APL is distinctive in being symbolic rather than lexical: its primitives are denoted by symbols, not words. These symbols were originally devised as a mathematical notation to describe algorithms. APL programmers often assign informal names when discussing functions and operators but the core functions and operators provided by the language are denoted by non-textual symbols.

<span class="mw-page-title-main">Vectorization (mathematics)</span> Conversion of a matrix or a tensor to a vector

In mathematics, especially in linear algebra and matrix theory, the vectorization of a matrix is a linear transformation which converts the matrix into a vector. Specifically, the vectorization of a m × n matrix A, denoted vec(A), is the mn × 1 column vector obtained by stacking the columns of the matrix A on top of one another:

In computer programming, a scientific programming language can refer to two degrees of the same concept.

The TomSym MATLAB symbolic modeling engine is a platform for modeling applied optimization and optimal control problems.

<span class="mw-page-title-main">Pure (programming language)</span> Functional programming language

Pure, successor to the equational language Q, is a dynamically typed, functional programming language based on term rewriting. It has facilities for user-defined operator syntax, macros, arbitrary-precision arithmetic, and compiling to native code through the LLVM. Pure is free and open-source software distributed (mostly) under the GNU Lesser General Public License version 3 or later.

<span class="mw-page-title-main">Matrix (mathematics)</span> Array of numbers

In mathematics, a matrix is a rectangular array or table of numbers, symbols, or expressions, arranged in rows and columns, which is used to represent a mathematical object or a property of such an object.

In computer science, array is a data type that represents a collection of elements, each selected by one or more indices that can be computed at run time during program execution. Such a collection is usually called an array variable or array value. By analogy with the mathematical concepts vector and matrix, array types with one and two indices are often called vector type and matrix type, respectively. More generally, a multidimensional array type can be called a tensor type, by analogy with the physical concept, tensor.

<span class="mw-page-title-main">Hadamard product (matrices)</span> Matrix operation

In mathematics, the Hadamard product is a binary operation that takes in two matrices of the same dimensions and returns a matrix of the multiplied corresponding elements. This operation can be thought as a "naive matrix multiplication" and is different from the matrix product. It is attributed to, and named after, either French mathematician Jacques Hadamard or German mathematician Issai Schur.

<span class="mw-page-title-main">GraphBLAS</span> API for graph data and graph operations

GraphBLAS is an API specification that defines standard building blocks for graph algorithms in the language of linear algebra. GraphBLAS is built upon the notion that a sparse matrix can be used to represent graphs as either an adjacency matrix or an incidence matrix. The GraphBLAS specification describes how graph operations can be efficiently implemented via linear algebraic methods over different semirings.

References

  1. Stéfan van der Walt; S. Chris Colbert & Gaël Varoquaux (2011). "The NumPy array: a structure for efficient numerical computation". Computing in Science and Engineering. IEEE. 13 (2): 22–30. arXiv: 1102.1523 . Bibcode:2011CSE....13b..22V. doi:10.1109/mcse.2011.37. S2CID   16907816.
  2. Iverson, K. E. (1980). "Notation as a Tool of Thought". Communications of the ACM. 23 (8): 444–465. doi: 10.1145/358896.358899 .
  3. Surana P (2006). Meta-Compilation of Language Abstractions (Thesis).
  4. Kuketayev. "The Data Abstraction Penalty (DAP) Benchmark for Small Objects in Java". Archived from the original on 2009-01-11. Retrieved 2008-03-17.
  5. Chatzigeorgiou; Stephanides (2002). "Evaluating Performance and Power Of Object-Oriented Vs. Procedural Programming Languages". In Blieberger; Strohmeier (eds.). Proceedings - 7th International Conference on Reliable Software Technologies - Ada-Europe'2002. Springer. p. 367. ISBN   978-3-540-43784-0.
  6. Ada Reference Manual: G.3.1 Real Vectors and Matrices
  7. "GNU Octave Manual. Arithmetic Operators" . Retrieved 2011-03-19.
  8. "MATLAB documentation. Arithmetic Operators". Archived from the original on 2010-09-07. Retrieved 2011-03-19.
  9. "GNU Octave Manual. Appendix G Installing Octave" . Retrieved 2011-03-19.
  10. "Reference for Armadillo 1.1.8. Examples of Matlab/Octave syntax and conceptually corresponding Armadillo syntax" . Retrieved 2011-03-19.
  11. "Blitz++ User's Guide. 3. Array Expressions". Archived from the original on 2011-03-23. Retrieved 2011-03-19.