Heterogeneous computing

Last updated

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks. [1]

Contents

Heterogeneity

Usually heterogeneity in the context of computing refers to different instruction-set architectures (ISA), where the main processor has one and other processors have another - usually a very different - architecture (maybe more than one), not just a different microarchitecture (floating point number processing is a special case of this - not usually referred to as heterogeneous).

In the past heterogeneous computing meant different ISAs had to be handled differently, while in a modern example, Heterogeneous System Architecture (HSA) systems [2] eliminate the difference (for the user) while using multiple processor types (typically CPUs and GPUs), usually on the same integrated circuit, to provide the best of both worlds: general GPU processing (apart from the GPU's well-known 3D graphics rendering capabilities, it can also perform mathematically intensive computations on very large data-sets), while CPUs can run the operating system and perform traditional serial tasks.

The level of heterogeneity in modern computing systems is gradually increasing as further scaling of fabrication technologies allows for formerly discrete components to become integrated parts of a system-on-chip, or SoC.[ citation needed ] For example, many new processors now include built-in logic for interfacing with other devices (SATA, PCI, Ethernet, USB, RFID, radios, UARTs, and memory controllers), as well as programmable functional units and hardware accelerators (GPUs, cryptography co-processors, programmable network processors, A/V encoders/decoders, etc.).

Recent findings show that a heterogeneous-ISA chip multiprocessor that exploits diversity offered by multiple ISAs can outperform the best same-ISA homogeneous architecture by as much as 21% with 23% energy savings and a reduction of 32% in Energy Delay Product (EDP). [3] AMD's 2014 announcement on its pin-compatible ARM and x86 SoCs, codename Project Skybridge, [4] suggested a heterogeneous-ISA (ARM+x86) chip multiprocessor in the making.[ citation needed ]

Heterogeneous CPU topology

A system with heterogeneous CPU topology is a system where the same ISA is used, but the cores themselves are different in speed. [5] The setup is more similar to a symmetric multiprocessor. (Although such systems are technically asymmetric multiprocessors, the cores do not differ in roles or device access.) There are typically two types of cores: a higher performance core usually known as the "big" or P-core and a more power efficient core usually known as the "small" or E-core. The terms P- and E-cores are usually used in relation to Intel's implementation of hetereogeneous computing, while the terms big and little cores are usually used in relation to the ARM architecture.

A common use of such topology is to provide better power efficiency, especially in mobile SoCs.

Challenges

Heterogeneous computing systems present new challenges not found in typical homogeneous systems. [7] The presence of multiple processing elements raises all of the issues involved with homogeneous parallel processing systems, while the level of heterogeneity in the system can introduce non-uniformity in system development, programming practices, and overall system capability. Areas of heterogeneity can include: [8]

ISA or instruction-set architecture
Compute elements may have different instruction set architectures, leading to binary incompatibility.
ABI or application binary interface
Compute elements may interpret memory in different ways. [9] This may include both endianness, calling convention, and memory layout, and depends on both the architecture and compiler being used.
API or application programming interface
Library and OS services may not be uniformly available to all compute elements. [10]
Low-Level Implementation of Language Features
Language features such as functions and threads are often implemented using function pointers, a mechanism which requires additional translation or abstraction when used in heterogeneous environments.
Memory Interface and Hierarchy
Compute elements may have different cache structures, cache coherency protocols, and memory access may be uniform or non-uniform memory access (NUMA). Differences can also be found in the ability to read arbitrary data lengths as some processors/units can only perform byte-, word-, or burst accesses. [11]
Interconnect
Compute elements may have differing types of interconnect aside from basic memory/bus interfaces. This may include dedicated network interfaces, Direct memory access (DMA) devices, mailboxes, FIFOs, and scratchpad memories, etc. Furthermore, certain portions of a heterogeneous system may be cache-coherent, whereas others may require explicit software-involvement for maintaining consistency and coherency.
Performance
A heterogeneous system may have CPUs that are identical in terms of architecture, but have underlying micro-architectural differences that lead to various levels of performance and power consumption. Asymmetries in capabilities paired with opaque programming models and operating system abstractions can sometimes lead to performance predictability problems, especially with mixed workloads.
Development tools
Different types of processors would typically require different tools (editors, compilers, ...) for software developers, which introduces complexity when partitioning the application across those. [12]
Data Partitioning
While partitioning data on homogeneous platforms is often trivial, it has been shown that for the general heterogeneous case, the problem is NP-Complete. [13] For small numbers of partitions, optimal partitionings that perfectly balance load and minimize communication volume have been shown to exist. [14]

Example hardware

Heterogeneous computing hardware can be found in every domain of computing—from high-end servers and high-performance computing machines all the way down to low-power embedded devices including mobile phones and tablets.

See also

Related Research Articles

<span class="mw-page-title-main">Field-programmable gate array</span> Array of logic gates that are reprogrammable

A field-programmable gate array (FPGA) is a type of configurable integrated circuit that can be programmed or reprogrammed after manufacturing. FPGAs are part of a broader set of logic devices referred to as programmable logic devices (PLDs). They consist of an array of programmable logic blocks and interconnects that can be configured to perform various digital functions. FPGAs are commonly used in applications where flexibility, speed, and parallel processing capabilities are required, such as in telecommunications, automotive, aerospace, and industrial sectors.

<span class="mw-page-title-main">System on a chip</span> Micro-electronic component

A system on a chip or system-on-chip is an integrated circuit that integrates most or all components of a computer or other electronic system. These components almost always include on-chip central processing unit (CPU), memory interfaces, input/output devices and interfaces, and secondary storage interfaces, often alongside other components such as radio modems and a graphics processing unit (GPU) – all on a single substrate or microchip. SoCs may contain digital and also analog, mixed-signal and often radio frequency signal processing functions.

HyperTransport (HT), formerly known as Lightning Data Transport, is a technology for interconnection of computer processors. It is a bidirectional serial/parallel high-bandwidth, low-latency point-to-point link that was introduced on April 2, 2001. The HyperTransport Consortium is in charge of promoting and developing HyperTransport technology.

<span class="mw-page-title-main">Digital signal processor</span> Specialized microprocessor optimized for digital signal processing

A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on metal–oxide–semiconductor (MOS) integrated circuit chips. They are widely used in audio signal processing, telecommunications, digital image processing, radar, sonar and speech recognition systems, and in common consumer electronic devices such as mobile phones, disk drives and high-definition television (HDTV) products.

Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with flexible hardware platforms like field-programmable gate arrays (FPGAs). The principal difference when compared to using ordinary microprocessors is the ability to add custom computational blocks using FPGAs. On the other hand, the main difference from custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric, thus providing new computational blocks without the need to manufacture and add new chips to the existing system.

<span class="mw-page-title-main">Coprocessor</span> Type of computer processor

A coprocessor is a computer processor used to supplement the functions of the primary processor. Operations performed by the coprocessor may be floating-point arithmetic, graphics, signal processing, string processing, cryptography or I/O interfacing with peripheral devices. By offloading processor-intensive tasks from the main processor, coprocessors can accelerate system performance. Coprocessors allow a line of computers to be customized, so that customers who do not need the extra performance do not need to pay for it.

<span class="mw-page-title-main">Xilinx</span> American technology company

Xilinx, Inc. was an American technology and semiconductor company that primarily supplied programmable logic devices. The company is known for inventing the first commercially viable field-programmable gate array (FPGA) and creating the first fabless manufacturing model.

The PowerPC 400 family is a line of 32-bit embedded RISC processor cores based on the PowerPC or Power ISA instruction set architectures. The cores are designed to fit inside specialized applications ranging from system-on-a-chip (SoC) microcontrollers, network appliances, application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) to set-top boxes, storage devices and supercomputers.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

<span class="mw-page-title-main">Larrabee (microarchitecture)</span>

Larrabee is the codename for a cancelled GPGPU chip that Intel was developing separately from its current line of integrated graphics accelerators. It is named after either Mount Larrabee or Larrabee State Park in Whatcom County, Washington, near the town of Bellingham. The chip was to be released in 2010 as the core of a consumer 3D graphics card, but these plans were cancelled due to delays and disappointing early performance figures. The project to produce a GPU retail product directly from the Larrabee research project was terminated in May 2010 and its technology was passed on to the Xeon Phi. The Intel MIC multiprocessor architecture announced in 2010 inherited many design elements from the Larrabee project, but does not function as a graphics processing unit; the product is intended as a co-processor for high performance computing.

The Advanced Learning and Research Institute (ALaRI), a faculty of informatics, was established in 1999 at the University of Lugano to promote research and education in embedded systems. The Faculty of Informatics within very few years has become one of the Switzerland major destinations for teaching and research, ranking third after the two Federal Institutes of Technology, Zurich and Lausanne.

<span class="mw-page-title-main">VideoCore</span> Low-power mobile multimedia processor

VideoCore is a series of low-power mobile multimedia processors originally developed by Alphamosaic Ltd and now owned by Broadcom. Alphamosaic marketed its first version as a two-dimensional DSP architecture that makes it flexible and efficient enough to decode a number of multimedia codecs in software while maintaining low power usage. The semiconductor intellectual property core has been found so far only on Broadcom SoCs.

Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores. Manycore processors are used extensively in embedded computers and high-performance computing.

Rockchip is a Chinese fabless semiconductor company based in Fuzhou, Fujian province. Rockchip has been providing SoC products for tablets & PCs, streaming media TV boxes, AI audio & vision, IoT hardware since founded in 2001. It has offices in Shanghai, Beijing, Shenzhen, Hangzhou and Hong Kong. It designs system on a chip (SoC) products, using the ARM architecture licensed from ARM Holdings for the majority of its projects.

Virtex is the flagship family of FPGA products currently developed by AMD, originally Xilinx before being acquired by the former. Other current product lines include Kintex (mid-range) and Artix (low-cost), each including configurations and models optimized for different applications. In addition, AMD offers the Spartan low-cost series, which continues to be updated and is nearing production utilizing the same underlying architecture and process node as the larger 7-series devices.

<span class="mw-page-title-main">Xeon Phi</span> Series of x86 manycore processors from Intel

Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programming interfaces (APIs) such as OpenMP.

Heterogeneous System Architecture (HSA) is a cross-vendor set of specifications that allow for the integration of central processing units and graphics processors on the same bus, with shared memory and tasks. The HSA is being developed by the HSA Foundation, which includes AMD and ARM. The platform's stated aim is to reduce communication latency between CPUs, GPUs and other compute devices, and make these various devices more compatible from a programmer's perspective, relieving the programmer of the task of planning the moving of data between devices' disjoint memories.

A vision processing unit (VPU) is an emerging class of microprocessor; it is a specific type of AI accelerator, designed to accelerate machine vision tasks.

Coherent Accelerator Processor Interface (CAPI), is a high-speed processor expansion bus standard for use in large data center computers, initially designed to be layered on top of PCI Express, for directly connecting central processing units (CPUs) to external accelerators like graphics processing units (GPUs), ASICs, FPGAs or fast storage. It offers low latency, high speed, direct memory access connectivity between devices of different instruction set architectures.

The ARM Cortex-A55 is a central processing unit implementing the ARMv8.2-A 64-bit instruction set designed by ARM Holdings' Cambridge design centre. The Cortex-A55 is a 2-wide decode in-order superscalar pipeline.

References

  1. Shan, Amar (2006). Heterogeneous Processing: a Strategy for Augmenting Moore's Law. Linux Journal.
  2. "Hetergeneous System Architecture (HSA) Foundation". Archived from the original on 2014-04-23. Retrieved 2014-11-01.
  3. Venkat, Ashish; Tullsen, Dean M. (2014). Harnessing ISA Diversity: Design of a Heterogeneous-ISA Chip Multiprocessor. Proceedings of the 41st Annual International Symposium on Computer Architecture.
  4. Anand Lal Shimpi (2014-05-05). "AMD Announces Project SkyBridge: Pin-Compatible ARM and x86 SoCs in 2015, Android Support". AnandTech. Retrieved 2017-06-11. Next year, AMD will release a low-power 20nm Cortex A57 based SoC with integrated Graphics Core Next GPU.
  5. "Energy Aware Scheduling". The Linux Kernel documentation.
  6. A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors, ACM Computing Surveys, 2015.
  7. Kunzman, D.M. (2011). Programming Heterogeneous Systems. International Symposium on Parallel and Distributed Processing Workshops. doi:10.1109/IPDPS.2011.377.
  8. Flachs, Brian (2009). Bringing Heterogeneous Processors Into The Mainstream (PDF). Symposium on Application Accelerators in High-Performance Computing (SAAHPC).
  9. "Cost-aware multimedia data allocation for heterogeneous memory using genetic algorithm in cloud computing" (PDF). IEEE. 2016.{{cite journal}}: Cite journal requires |journal= (help)
  10. Agron, Jason; Andrews, David (2009). Hardware Microkernels for Heterogeneous Manycore Systems. Parallel Processing Workshops, 2009. International Conference on Parallel Processing (ICPPW). doi:10.1109/ICPPW.2009.21.
  11. Lang, Johannes (2020). Heterogenes Rechnen mit ARM und DSP Multiprozessor-Ein-Chip-Systemen (MSc.). Fachhochschule Vorarlberg. doi:10.25924/opus-4525.
  12. Wong, William G. (30 September 2002). "Tools Matter In Mixed-Processor Software Development". www.electronicdesign.com. Retrieved 2023-08-09.
  13. Beaumont, Olivier; Boudet, Vincent; Rastello, Fabrice; Robert, Yves (August 2002). "Partitioning a square into rectangles: NP-completeness and approximation algorithms" (PDF). Algorithmica. 34 (3): 217–239. CiteSeerX   10.1.1.3.4967 . doi:10.1007/s00453-002-0962-9. S2CID   9729067.
  14. Beaumont, Olivier; Becker, Brett; DeFlumere, Ashley; Eyraud-Dubois, Lionel; Lastovetsky, Alexey (July 2018). "Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms" (PDF). IEEE Transactions on Parallel and Distributed Computing.
  15. Gschwind, Michael (2005). A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor (PDF). Hot Chips: A Symposium on High Performance Chips. Archived from the original (PDF) on 2020-06-18. Retrieved 2014-10-28.