Intel Advisor

Last updated
Intel Advisor
Developer(s) Intel Developer Products
Stable release
2021.4 / October 1, 2021;2 years ago (2021-10-01) [1]
Operating system Windows and Linux (UI-only on macOS)
Type Profiler
License Free and commercial support
Website software.intel.com/content/www/us/en/develop/tools/oneapi/components/advisor.html

Intel Advisor (also known as "Advisor XE", "Vectorization Advisor" or "Threading Advisor") is a design assistance and analysis tool for SIMD vectorization, threading, memory use, and GPU offload optimization. The tool supports C, C++, Data Parallel C++ (DPC++), Fortran and Python languages. It is available on Windows and Linux operating systems in form of Standalone GUI tool, Microsoft Visual Studio plug-in or command line interface. [2] It supports OpenMP (and usage with MPI). Intel Advisor user interface is also available on macOS.

Contents

Intel Advisor is available for free as a stand-alone tool or as part of the Intel oneAPI Base Toolkit. Optional paid commercial support is available for the oneAPI Base Toolkit.

Features

Vectorization optimization

Vectorization is the operation of Single Instruction Multiple Data (SIMD) instructions (like Intel Advanced Vector Extensions and Intel Advanced Vector Extensions 512) on multiple objects in parallel within a single CPU core. This can greatly increase performance by reducing loop overhead and making better use of the multiple math units in each core.

Intel Advisor helps find the loops that will benefit from better vectorization, identify where it is safe to force compiler vectorization. [3] It supports analysis of scalar, SSE, AVX, AVX2 and AVX-512-enabled codes generated by Intel, GNU and Microsoft compilers auto-vectorization. It also supports analysis of "explicitly" vectorized codes which use OpenMP 4.x and newer as well as codes or written using C vector intrinsics or assembly language. [4] [5]

Automated Roofline analysis

Intel Advisor automates the Roofline Performance Model first proposed at Berkeley [6] and extended at the University of Lisbon. [7]

Roofline Performance Model automation integrated with other features in Intel Advisor. Each circle corresponds to one loop or function. Roofline in Intel Advisor.png
Roofline Performance Model automation integrated with other features in Intel Advisor. Each circle corresponds to one loop or function.

Advisor "Roofline Analysis" helps to identify if given loop/function is memory or CPU bound. It also identifies under optimized loops that can have a high impact on performance if optimized. [8] [9] [10] [11]

Intel Advisor also provides an automated memory-level roofline implementation that is closer to the classical Roofline model. Classical Roofline is especially instrumental for high performance computing applications that are DRAM-bound. Advisor memory level roofline analyzes cache data and evaluates the data transactions between different memory layers to provide guidance for improvement. [12]

Intel Advisor roofline analysis supports code running on CPU or GPU. [13] [14] It also supports integer based applications - that is heavily used in machine learning, big data domains, database applications, financial applications like crypto-coins. [15]

Threading prototyping

Software architects add code annotations to describe threading that are understood by Advisor, but ignored by the compiler. Advisor then projects the scalability of the threading and checks for synchronization errors. Advisor Threading "Suitability" feature helps to predict and compare the parallel SMP scalability and performance losses for different possible threading designs. Typical Suitability reports are shown on Suitability CPU screen-shot on the right side. Advisor Suitability provides dataset size (iteration space) modeling capabilities and performance penalties break-down (exposing negative impact caused by Load Imbalance, Parallel Runtimes Overhead and Lock Contention). [16]

Suitability "CPU model" IntelAdvisorSuitabilityCPU.png
Suitability "CPU model"

Offload modelling

Advisor adds GPU offload performance modeling feature in the 2021 release. It collects application performance characteristics on a baseline platform and builds analytical performance model for target (modelled) platform.

This provides performance speedup estimation on target GPUs and overhead estimations for offloading, data transfer and scheduling region execution and pinpoints performance bottlenecks. [17] [18] [19] This information can serve for choosing offload strategy: selecting regions to offload and anticipate potential code restructuring needed to make it GPU-ready.

Customer usage

Intel Advisor is used by Schlumberger, [20] Sandia national lab, and others [21] for design and parallel algorithm research and Vectorization Advisor capabilities known to be used by LRZ and ICHEC, [22] Daresbury Lab, [23] Pexip. [24]

The step-by-step workflow is used by academia for educational purposes. [25]

See also

Related Research Articles

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

<span class="mw-page-title-main">OpenMP</span> Open standard for parallelizing

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

<span class="mw-page-title-main">Graphics processing unit</span> Specialized electronic circuit; graphics accelerator

A graphics processing unit (GPU) is a specialized electronic circuit initially designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles. After their initial design, GPUs were found to be useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. Other non-graphical uses include the training of neural networks and cryptocurrency mining.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

VTune Profiler is a performance analysis tool for x86-based machines running Linux or Microsoft Windows operating systems. Many features work on both Intel and AMD hardware, but the advanced hardware-based sampling features require an Intel-manufactured CPU.

<span class="mw-page-title-main">CUDA</span> Parallel computing platform and programming model

In computing, CUDA is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations. CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

Intel oneAPI DPC++/C++ Compiler and Intel C++ Compiler Classic are Intel’s C, C++, SYCL, and Data Parallel C++ (DPC++) compilers for Intel processor-based systems, available for Windows, Linux, and macOS operating systems.

<span class="mw-page-title-main">Data parallelism</span> Parallelization across multiple processors in parallel computing environments

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

<span class="mw-page-title-main">Larrabee (microarchitecture)</span> Canceled Intel GPGPU chip

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in the state of Washington. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

Intel Fortran Compiler, as part of Intel OneAPI HPC toolkit, is a group of Fortran compilers from Intel for Windows, macOS, and Linux.

<span class="mw-page-title-main">OpenCL</span> Open standard for programming heterogenous computing systems, such as CPUs or GPUs

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

Intel Parallel Studio XE was a software development product developed by Intel that facilitated native code development on Windows, macOS and Linux in C++ and Fortran for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors.

Intel Inspector is a memory and thread checking and debugging tool to increase the reliability, security, and accuracy of C/C++ and Fortran applications.

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

<span class="mw-page-title-main">Xeon Phi</span> Series of x86 manycore processors from Intel

Xeon Phi is a discontinued series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programming interfaces (APIs) such as OpenMP.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

<span class="mw-page-title-main">SYCL</span> Higher-level programming standard for heterogeneous computing

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators. It is a single-source embedded domain-specific language (eDSL) based on pure C++17. It is a standard developed by Khronos Group, announced in March 2014.

<span class="mw-page-title-main">Intel Xe</span> Intel GPU architecture

Intel Xe, earlier known unofficially as Gen12, is a GPU architecture developed by Intel.

oneAPI (compute acceleration) Open standard for parallel computing

oneAPI is an open standard, adopted by Intel, for a unified application programming interface (API) intended to be used across different computing accelerator (coprocessor) architectures, including GPUs, AI accelerators and field-programmable gate arrays. It is intended to eliminate the need for developers to maintain separate code bases, multiple programming languages, tools, and workflows for each architecture.

References

  1. "Intel® Advisor Release Notes and New Features".
  2. "Command Line Use Cases". Intel. Retrieved 2021-01-05.
  3. "Optimize Vectorization Aspects of a Real-Time 3D Cardiac..." Intel. Retrieved 2021-01-07.
  4. "HPC Code Modernization Tools" (PDF).
  5. "Новый инструмент анализа SIMD программ — Vectorization Advisor". habr.com (in Russian). Retrieved 2021-01-05.
  6. Williams, Samuel (April 2009). "Roofline: An insightful Visual Performance model for multicore Architectures" (PDF). University of Berkeley. Archived from the original (PDF) on 2016-12-06. Retrieved 2017-03-29.
  7. Ilic, Aleksandar. "Cache-aware Roofline model: Upgrading the loft" (PDF). Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa.
  8. "Roofline Analysis in Intel Advisor 2017: youtube how-to video". YouTube .
  9. "Intel Advisor Roofline step-by-step Tutorial".
  10. "Using Roofline Model and Intel Advisor, presented by Sam Williams, Roofline performance model author".
  11. "Case Study: SimYog Improves a Simulation Tool Performance by 2x with..." Intel. Retrieved 2021-01-07.
  12. "Memory-Level Roofline Model with Intel® Advisor". Intel. Retrieved 2021-01-05.
  13. "CPU / Memory Roofline Insights Perspective". Intel. Retrieved 2021-01-05.
  14. "GPU Roofline Insights Perspective". Intel. Retrieved 2021-01-05.
  15. "Integer Roofline Modeling in Intel® Advisor". Intel. Retrieved 2021-01-05.
  16. "How to model suitability using Advisor XE 2015?".
  17. "Offload Modeling Resources for Intel® Advisor Users". Intel. Retrieved 2021-01-05.
  18. "Identify Code Regions to Offload to GPU and Visualize GPU Usage (Beta)". Intel. Retrieved 2021-01-05.
  19. "Offload Modeling Perspective". Intel. Retrieved 2021-01-05.
  20. "Schlumberger* - Parallelize Oil and Gas software with Intel Software products" (PDF).
  21. ""Leading design" company Advisor XE case study" (PDF).[ permanent dead link ]
  22. "Design Code for Parallelism and Offloading with Intel® Advisor".
  23. "Computer-Aided Formulation case study: getting helping hand from the Vectorization Advisor".
  24. "Pexip Speeds Enterprise-Grade Videoconferencing" (PDF).
  25. "Supercomputing'2012 HPC educator with Slippery Rock University".