Developer(s) | Intel Developer Products |
---|---|
Stable release | 2021.4 / October 1, 2021 [1] |
Operating system | Windows and Linux (UI-only on macOS) |
Type | Profiler |
License | Free and commercial support |
Website | software |
Intel Advisor (also known as "Advisor XE", "Vectorization Advisor" or "Threading Advisor") is a design assistance and analysis tool for SIMD vectorization, threading, memory use, and GPU offload optimization. The tool supports C, C++, Data Parallel C++ (DPC++), Fortran and Python languages. It is available on Windows and Linux operating systems in form of Standalone GUI tool, Microsoft Visual Studio plug-in or command line interface. [2] It supports OpenMP (and usage with MPI). Intel Advisor user interface is also available on macOS.
Intel Advisor is available for free as a stand-alone tool or as part of the Intel oneAPI Base Toolkit. Optional paid commercial support is available for the oneAPI Base Toolkit.
Vectorization is the operation of Single Instruction Multiple Data (SIMD) instructions (like Intel Advanced Vector Extensions and Intel Advanced Vector Extensions 512) on multiple objects in parallel within a single CPU core. This can greatly increase performance by reducing loop overhead and making better use of the multiple math units in each core.
Intel Advisor helps find the loops that will benefit from better vectorization, identify where it is safe to force compiler vectorization. [3] It supports analysis of scalar, SSE, AVX, AVX2 and AVX-512-enabled codes generated by Intel, GNU and Microsoft compilers auto-vectorization. It also supports analysis of "explicitly" vectorized codes which use OpenMP 4.x and newer as well as codes or written using C vector intrinsics or assembly language. [4] [5]
Intel Advisor automates the Roofline Performance Model first proposed at Berkeley [6] and extended at the University of Lisbon. [7]
Advisor "Roofline Analysis" helps to identify if given loop/function is memory or CPU bound. It also identifies under optimized loops that can have a high impact on performance if optimized. [8] [9] [10] [11]
Intel Advisor also provides an automated memory-level roofline implementation that is closer to the classical Roofline model. Classical Roofline is especially instrumental for high performance computing applications that are DRAM-bound. Advisor memory level roofline analyzes cache data and evaluates the data transactions between different memory layers to provide guidance for improvement. [12]
Intel Advisor roofline analysis supports code running on CPU or GPU. [13] [14] It also supports integer based applications - that is heavily used in machine learning, big data domains, database applications, financial applications like crypto-coins. [15]
Software architects add code annotations to describe threading that are understood by Advisor, but ignored by the compiler. Advisor then projects the scalability of the threading and checks for synchronization errors. Advisor Threading "Suitability" feature helps to predict and compare the parallel SMP scalability and performance losses for different possible threading designs. Typical Suitability reports are shown on Suitability CPU screen-shot on the right side. Advisor Suitability provides dataset size (iteration space) modeling capabilities and performance penalties break-down (exposing negative impact caused by Load Imbalance, Parallel Runtimes Overhead and Lock Contention). [16]
Advisor adds GPU offload performance modeling feature in the 2021 release. It collects application performance characteristics on a baseline platform and builds analytical performance model for target (modelled) platform.
This provides performance speedup estimation on target GPUs and overhead estimations for offloading, data transfer and scheduling region execution and pinpoints performance bottlenecks. [17] [18] [19] This information can serve for choosing offload strategy: selecting regions to offload and anticipate potential code restructuring needed to make it GPU-ready.
Intel Advisor is used by Schlumberger, [20] Sandia national lab, and others [21] for design and parallel algorithm research and Vectorization Advisor capabilities known to be used by LRZ and ICHEC, [22] Daresbury Lab, [23] Pexip. [24]
The step-by-step workflow is used by academia for educational purposes. [25]
Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.
OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.
A graphics processing unit (GPU) is a specialized electronic circuit initially designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.
In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.
VTune Profiler is a performance analysis tool for x86-based machines running Linux or Microsoft Windows operating systems. Many features work on both Intel and AMD hardware, but the advanced hardware-based sampling features require an Intel-manufactured CPU.
In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations. CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.
Intel oneAPI DPC++/C++ Compiler and Intel C++ Compiler Classic are Intel’s C, C++, SYCL, and Data Parallel C++ (DPC++) compilers for Intel processor-based systems, available for Windows, Linux, and macOS operating systems.
Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.
Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in the state of Washington. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.
Intel Fortran Compiler, as part of Intel OneAPI HPC toolkit, is a group of Fortran compilers from Intel for Windows, macOS, and Linux.
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.
Intel Parallel Studio XE was a software development product developed by Intel that facilitated native code development on Windows, macOS and Linux in C++ and Fortran for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors.
Intel Inspector is a memory and thread checking and debugging tool to increase the reliability, security, and accuracy of C/C++ and Fortran applications.
Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.
Xeon Phi is a discontinued series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programming interfaces (APIs) such as OpenMP.
Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language (eDSL) based on pure C++17. It is a standard developed by Khronos Group, announced in March 2014.
Intel Xe, earlier known unofficially as Gen12, is a GPU architecture developed by Intel.
oneAPI is an open standard, adopted by Intel, for a unified application programming interface (API) intended to be used across different computing accelerator (coprocessor) architectures, including GPUs, AI accelerators and field-programmable gate arrays. It is intended to eliminate the need for developers to maintain separate code bases, multiple programming languages, tools, and workflows for each architecture.