Intrinsic function

Last updated

In computer software, in compiler theory, an intrinsic function, also called built-in function or builtin function, is a function (subroutine) available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it may substitute a sequence of automatically generated instructions for the original function call, similar to an inline function. [1] Unlike an inline function, the compiler has an intimate knowledge of an intrinsic function and can thus better integrate and optimize it for a given situation.

Contents

Compilers that implement intrinsic functions may enable them only when a program requests optimization, otherwise falling back to a default implementation provided by the language runtime system (environment).

Vectorization and parallelization

Intrinsic functions are often used to explicitly implement vectorization and parallelization in languages which do not address such constructs. Some application programming interfaces (API), for example, AltiVec and OpenMP, use intrinsic functions to declare, respectively, vectorizable and multiprocessing-aware operations during compiling. The compiler parses the intrinsic functions and converts them into vector math or multiprocessing object code appropriate for the target platform. Some intrinsics are used to provide additional constraints to the optimizer, such as values a variable cannot assume. [2]

By programming language

C and C++

Compilers for C and C++, of Microsoft, [3] Intel, [1] and the GNU Compiler Collection (GCC) [4] implement intrinsics that map directly to the x86 single instruction, multiple data (SIMD) instructions (MMX, Streaming SIMD Extensions (SSE), SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512, FMA, ...). Intrinsics allow mapping to standard assembly instructions that are not normally accessible through C/C++, e.g., bit scan.

Some C and C++ compilers provide non-portable platform-specific intrinsics. Other intrinsics (such as GNU built-ins) are slightly more abstracted, approximating the abilities of several contemporary platforms, with portable fall back implementations on platforms with no appropriate instructions. [5] It is common for C++ libraries, such as glm or Sony's vector maths libraries, [6] to achieve portability via conditional compilation (based on platform specific compiler flags), providing fully portable high-level primitives (e.g., a four-element floating-point vector type) mapped onto the appropriate low level programming language implementations, while still benefiting from the C++ type system and inlining; hence the advantage over linking to hand-written assembly object files, using the C application binary interface (ABI).

Examples

The following are examples of signatures of intrinsic functions from Intel's set of intrinsic functions.

uint64_t__rdtsc();// return internal CPU clock counteruint64_t__popcnt64(uint64_tn);// count of bits set in nuint64_t_umul128(uint64_tFactor1,uint64_tFactor2,uint64_t*HighProduct);// 64 bit * 64 bit => 128 bit multiplication__m512_mm512_add_ps(__m512a,__m512b);// calculates a + b for two vectors of 16 floats__m512_mm512_fmadd_ps(__m512a,__m512b,__m512c);// calculates a*b + c for three vectors of 16 floats

[7]

Java

The HotSpot Java virtual machine's (JVM) just-in-time compiler also has intrinsics for specific Java APIs. [8] Hotspot intrinsics are standard Java APIs which may have one or more optimized implementation on some platforms.

PL/I

ANSI/ISO PL/I defines nearly 90 builtin functions. [9] These are conventionally grouped as follows: [10] :337–338

Individual compilers have added additional builtins specific to a machine architecture or operating system.

A builtin function is identified by leaving its name undeclared and allowing it to default, or by declaring it BUILTIN. A user-supplied function of the same name can be substituted by declaring it as ENTRY.

Related Research Articles

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

AltiVec is a single-precision floating point and integer SIMD instruction set designed and owned by Apple, IBM, and Freescale Semiconductor — the AIM alliance. It is implemented on versions of the PowerPC processor architecture, including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's PWRficient PA6T. AltiVec is a trademark owned solely by Freescale, so the system is also referred to as Velocity Engine by Apple and VMX by IBM and P.A. Semi.

SSE2 is one of the Intel SIMD processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. SSE2 instructions allow the use of XMM (SIMD) registers on x86 instruction set architecture processors. These registers can load up to 128 bits of data and perform instructions, such as vector addition and multiplication, simultaneously.

<span class="mw-page-title-main">OpenMP</span> Open standard for parallelizing

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator ; the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator a:

In computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the language specification of the programming language in which the source code is written. This is different from unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the platform.

<span class="mw-page-title-main">C99</span> C programming language standard, 1999 revision

C99 is a past version of the C programming language open standard. It extends the previous version (C90) with new features for the language and the standard library, and helps implementations make better use of available computer hardware, such as IEEE 754-1985 floating-point arithmetic, and compiler technology. The C11 version of the C programming language standard, published in 2011, updates C99.

<span class="mw-page-title-main">LLVM</span> Compiler backend for multiple programming languages

LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes. The name LLVM originally stood for Low Level Virtual Machine, though the project has expanded and the name is no longer officially an initialism.

In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.

<span class="mw-page-title-main">Hamming weight</span> Number of nonzero symbols in a string

The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string, or the digit sum of the binary representation of a given number and the ₁ norm of a bit vector. In this binary case, it is also called the population count, popcount, sideways sum, or bit summation.

In C and related programming languages, long double refers to a floating-point data type that is often more precise than double precision though the language standard only requires it to be at least as precise as double. As with C's other floating-point types, it may not necessarily map to an IEEE format.

Interprocedural optimization (IPO) is a collection of compiler techniques used in computer programming to improve performance in programs containing many frequently used functions of small or medium length. IPO differs from other compiler optimizations by analyzing the entire program as opposed to a single function or block of code.

This article describes the calling conventions used when programming x86 architecture microprocessors.

Advanced Vector Extensions are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge microarchitecture shipping in Q1 2011 and later by AMD with the Bulldozer microarchitecture shipping in Q4 2011. AVX provides new features, new instructions, and a new coding scheme.

<span class="mw-page-title-main">OpenCL</span> Open standard for programming heterogenous computing systems, such as CPUs or GPUs

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies a programming language for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

In computer software and hardware, find first set (ffs) or find first one is a bit operation that, given an unsigned machine word, designates the index or position of the least significant bit set to one in the word counting from the least significant bit position. A nearly equivalent operation is count trailing zeros (ctz) or number of trailing zeros (ntz), which counts the number of zero bits following the least significant one bit. The complementary operation that finds the index or position of the most significant set bit is log base 2, so called because it computes the binary logarithm ⌊log2(x)⌋. This is closely related to count leading zeros (clz) or number of leading zeros (nlz), which counts the number of zero bits preceding the most significant one bit. There are two common variants of find first set, the POSIX definition which starts indexing of bits at 1, herein labelled ffs, and the variant which starts indexing of bits at zero, which is equivalent to ctz and so will be called by that name.

Intel Advisor is a design assistance and analysis tool for SIMD vectorization, threading, memory use, and GPU offload optimization. The tool supports C, C++, Data Parallel C++ (DPC++), Fortran and Python languages. It is available on Windows and Linux operating systems in form of Standalone GUI tool, Microsoft Visual Studio plug-in or command line interface. It supports OpenMP. Intel Advisor user interface is also available on macOS.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200, and then later in a number of AMD and other Intel CPUs. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

Permute instructions, part of bit manipulation as well as vector processing, copy unaltered contents from a source array to a destination array, where the indices are specified by a second source array. The size (bitwidth) of the source elements is not restricted but remains the same as the destination size.

References

  1. 1 2 "Intel® C++ Compiler 19.1 Developer Guide and Reference". Intel® C++ Compiler Documentation. 16 December 2019. Retrieved 2020-01-17.
  2. The Clang Team (2020). "Clang Language Extensions". Clang 11 documentation. Retrieved 2020-01-17. Builtin Functions
  3. MSDN. "Compiler Intrinsics". Microsoft . Retrieved 2012-06-20.
  4. GCC documentation. "Built-in Functions Specific to Particular Target Machines". Free Software Foundation . Retrieved 2012-06-20.
  5. "Vector Extensions". Using the GNU Compiler Collection (GCC). Retrieved 2020-01-16.
  6. "Sony open sources Vector Math and SIMD math libraries (Cell PPU/SPU/other platforms)". Beyond3D Forum. Archived from the original on 2016-06-24. Retrieved 2020-01-17.
  7. Intel Intrinsics
  8. Mok, Kris (25 February 2013). "Intrinsic Methods in HotSpot VM". Slideshare. Retrieved 2014-12-20.
  9. ANSI X3 Committee (1976). American National Standard programming language PL/I.{{cite book}}: CS1 maint: numeric names: authors list (link)
  10. IBM Corporation (1995). IBM PL/I for MVS & VM Language Reference.