Loop fission and fusion

Last updated

In computer science, loop fission (or loop distribution) is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop's body. [1] [2] The goal is to break down a large loop body into smaller ones to achieve better utilization of locality of reference. This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor.

Contents

Conversely, loop fusion (or loop jamming) is a compiler optimization and loop transformation which replaces multiple loops with a single one. [3] [2] Loop fusion does not always improve run-time speed. On some architectures, two loops may actually perform better than one loop because, for example, there is increased data locality within each loop. One of the main benefits of loop fusion is that it allows temporary allocations to be avoided, which can lead to huge performance gains in numerical computing languages such as Julia when doing elementwise operations on arrays (however, Julia's loop fusion is not technically a compiler optimization, but a syntactic guarantee of the language). [4]

Other benefits of loop fusion are that it avoids the overhead of the loop control structures, and also that it allows the loop body to be parallelized by the processor [5] by taking advantage of instruction-level parallelism. This is possible when there are no data dependencies between the bodies of the two loops (this is in stark contrast to the other main benefit of loop fusion described above, which only presents itself when there are data dependencies that require an intermediate allocation to store the results). If loop fusion is able to remove redundant allocations, performance increases can be large. [4] Otherwise, there is a more complex trade-off between data locality, instruction-level parallelism, and loop overhead (branching, incrementing, etc.) that may make loop fusion, loop fission, or neither, the preferable optimization.

Fission

Example in C

inti,a[100],b[100];for(i=0;i<100;i++){a[i]=1;b[i]=2;}

is equivalent to:

inti,a[100],b[100];for(i=0;i<100;i++){a[i]=1;}for(i=0;i<100;i++){b[i]=2;}

Fusion

Example in C++ and MATLAB

Consider the following MATLAB code:

x=0:999;% Create an array of numbers from 0 to 999 (range is inclusive)y=sin(x)+4;% Take the sine of x (element-wise) and add 4 to each element

You could achieve the same syntax in C++ by using function and operator overloading:

#include<cmath>#include<cassert>#include<memory>#include<iostream>classArray{size_tlength;std::unique_ptr<float[]>data;// Internal constructor that produces an uninitialized arrayArray(size_tn):length(n),data(newfloat[n]){}public:// Factory method to produce an array over an integer range (the upper// bound is exclusive, unlike MATLAB's ranges).staticArrayRange(size_tstart,size_tend){assert(end>start);size_tlength=end-start;Arraya(length);for(size_ti=0;i<length;++i){a[i]=start+i;}returna;}// Basic array operationssize_tsize()const{returnlength;}float&operator[](size_ti){returndata[i];}constfloat&operator[](size_ti)const{returndata[i];}// Declare an overloaded addition operator as a free friend function (this// syntax defines operator+ as a free function that is a friend of this// class, despite it appearing as a member function declaration).friendArrayoperator+(constArray&a,floatb){Arrayc(a.size());for(size_ti=0;i<a.size();++i){c[i]=a[i]+b;}returnc;}// Similarly, we can define an overload for the sin() function. In practice,// it would be unwieldy to define all possible overloaded math operations as// friends inside the class like this, but this is just an example.friendArraysin(constArray&a){Arrayb(a.size());for(size_ti=0;i<a.size();++i){b[i]=std::sin(a[i]);}returnb;}};intmain(intargc,char*argv[]){// Here, we perform the same computation as the MATLAB exampleautox=Array::Range(0,1000);autoy=sin(x)+4;// Print the result out - just to make sure the optimizer doesn't remove// everything (if it's smart enough to do so).std::cout<<"The result is: "<<std::endl;for(size_ti=0;i<y.size();++i){std::cout<<y[i]<<std::endl;}return0;}

However, the above example unnecessarily allocates a temporary array for the result of sin(x). A more efficient implementation would allocate a single array for y, and compute y in a single loop. To optimize this, a C++ compiler would need to:

  1. Inline the sin and operator+ function calls.
  2. Fuse the loops into a single loop.
  3. Remove the unused stores into the temporary arrays (can use a register or stack variable instead).
  4. Remove the unused allocation and free.

All of these steps are individually possible. Even step four is possible despite the fact that functions like malloc and free have global side effects, since some compilers hardcode symbols such as malloc and free so that they can remove unused allocations from the code. [6] However, as of clang 12.0.0 and gcc 11.1, this loop fusion and redundant allocation removal does not occur - even on the highest optimization level. [7] [8]

Some languages specifically targeted towards numerical computing such as Julia might have the concept of loop fusion built into it at a high level, where the compiler will notice adjacent elementwise operations and fuse them into a single loop. [9] Currently, to achieve the same syntax in general purpose languages like C++, the sin and operator+ functions must pessimistically allocate arrays to store their results, since they do not know what context they will be called from. This issue can be avoided in C++ by using a different syntax that does not rely on the compiler to remove unnecessary temporary allocations (e.g., using functions and overloads for in-place operations, such as operator+= or std::transform).

See also

Related Research Articles

Templates are a feature of the C++ programming language that allows functions and classes to operate with generic types. This allows a function or class declaration to reference via a generic variable another different class without creating full declaration for each of these different classes.

In computer programming, an iterator is an object that enables a programmer to traverse a container, particularly lists. Various types of iterators are often provided via a container's interface. Though the interface and semantics of a given iterator are fixed, iterators are often implemented in terms of the structures underlying a container implementation and are often tightly coupled to the container to enable the operational semantics of the iterator. An iterator performs traversal and also gives access to data elements in a container, but does not itself perform iteration.

Template metaprogramming (TMP) is a metaprogramming technique in which templates are used by a compiler to generate temporary source code, which is merged by the compiler with the rest of the source code and then compiled. The output of these templates can include compile-time constants, data structures, and complete functions. The use of templates can be thought of as compile-time polymorphism. The technique is used by a number of languages, the best-known being C++, but also Curl, D, Nim, and XL.

<span class="mw-page-title-main">D (programming language)</span> Multi-paradigm system programming language

D, also known as dlang, is a multi-paradigm system programming language created by Walter Bright at Digital Mars and released in 2001. Andrei Alexandrescu joined the design and development effort in 2007. Though it originated as a re-engineering of C++, D is a profoundly different language —features of D can be considered streamlined and expanded-upon ideas from C++, however D also draws inspiration from other high-level programming languages, notably Java, Python, Ruby, C#, and Eiffel.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in the C language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In computer programming, a function object is a construct allowing an object to be invoked or called as if it were an ordinary function, usually with the same syntax. In some languages, particularly C++, function objects are often called functors.

In the C++ programming language, a copy constructor is a special constructor for creating a new object as a copy of an existing object. Copy constructors are the standard way of copying objects in C++, as opposed to cloning, and have C++-specific nuances.

In computer science, array programming refers to solutions that allow the application of operations to an entire set of values at once. Such solutions are commonly used in scientific and engineering settings.

C++/CLI is a variant of the C++ programming language, modified for Common Language Infrastructure. It has been part of Visual Studio 2005 and later, and provides interoperability with other .NET languages such as C#. Microsoft created C++/CLI to supersede Managed Extensions for C++. In December 2005, Ecma International published C++/CLI specifications as the ECMA-372 standard.

In the C++ programming language, new and delete are a pair of language constructs that perform dynamic memory allocation, object construction and object destruction.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

A property, in some object-oriented programming languages, is a special sort of class member, intermediate in functionality between a field and a method. The syntax for reading and writing of properties is like for fields, but property reads and writes are (usually) translated to 'getter' and 'setter' method calls. The field-like syntax is easier to read and write than many method calls, yet the interposition of method calls "under the hood" allows for data validation, active updating, or implementation of what may be called "read-only fields".

In computer programming, an anonymous function is a function definition that is not bound to an identifier. Anonymous functions are often arguments being passed to higher-order functions or used for constructing the result of a higher-order function that needs to return a function. If the function is only used once, or a limited number of times, an anonymous function may be syntactically lighter than using a named function. Anonymous functions are ubiquitous in functional programming languages and other languages with first-class functions, where they fulfil the same role for the function type as literals do for other data types.

C++ doesn't have:

In the C++ programming language, placement syntax allows programmers to explicitly specify the memory management of individual objects — i.e. their "placement" in memory. Normally, when an object is created dynamically, an allocation function is invoked in such a way that it will both allocate memory for the object, and initialize the object within the newly allocated memory. The placement syntax allows the programmer to supply additional arguments to the allocation function. A common use is to supply a pointer to a suitable region of storage where the object can be initialized, thus separating memory allocation from object construction.

Expression templates are a C++ template metaprogramming technique that builds structures representing a computation at compile time, where expressions are evaluated only as needed to produce efficient code for the entire computation. Expression templates thus allow programmers to bypass the normal order of evaluation of the C++ language and achieve optimizations such as loop fusion.

C++14 is a version of the ISO/IEC 14882 standard for the C++ programming language. It is intended to be a small extension over C++11, featuring mainly bug fixes and small improvements, and was replaced by C++17. Its approval was announced on August 18, 2014. C++14 was published as ISO/IEC 14882:2014 in December 2014.

<span class="mw-page-title-main">Nim (programming language)</span> Programming language

Nim is a general-purpose, multi-paradigm, statically typed, compiled high-level systems programming language, designed and developed by a team around Andreas Rumpf. Nim is designed to be "efficient, expressive, and elegant", supporting metaprogramming, functional, message passing, procedural, and object-oriented programming styles by providing several features such as compile time code generation, algebraic data types, a foreign function interface (FFI) with C, C++, Objective-C, and JavaScript, and supporting compiling to those same languages as intermediate representations.

<span class="mw-page-title-main">Zig (programming language)</span> A general-purpose programming language, and toolchain to build Zig/C/C++ code

Zig is an imperative, general-purpose, statically typed, compiled system programming language designed by Andrew Kelley. It is intended to be a replacement for the C programming language, with the goals of being even smaller and simpler to program in while also offering modern features, new optimizations and a variety of safety mechanisms while not as demanding of runtime safety as seen in other languages. It is distinct from languages like Go, Rust and Carbon, which have similar goals but also target the C++ space.

References

  1. Y.N. Srikant; Priti Shankar (3 October 2018). The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition. CRC Press. ISBN   978-1-4200-4383-9.
  2. 1 2 Kennedy, Ken & Allen, Randy. (2001). Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann. ISBN   1-55860-286-0.
  3. Steven Muchnick; Muchnick and Associates (15 August 1997). Advanced Compiler Design Implementation . Morgan Kaufmann. ISBN   978-1-55860-320-2. loop fusion.
  4. 1 2 Johnson, Steven G. (21 January 2017). "More Dots: Syntactic Loop Fusion in Julia". julialang.org. Retrieved 25 June 2021.
  5. "Loop Fusion". Intel. Retrieved 2021-06-25.
  6. Godbolt, Matt. "Compiler Explorer - C++ (x86-64 clang 12.0.0)". godbolt.org. Retrieved 2021-06-25.
  7. Godbolt, Matt. "Compiler Explorer - C++ (x86-64 clang 12.0.0)". godbolt.org. Retrieved 2021-06-25.
  8. Godbolt, Matt. "Compiler Explorer - C++ (x86-64 gcc 11.1)". godbolt.org. Retrieved 2021-06-25.
  9. "Functions · The Julia Language". docs.julialang.org. Retrieved 2021-06-25.