Simultaneous and heterogeneous multithreading

Last updated

Simultaneous and heterogeneous multithreading (SHMT) is a software framework that takes advantage of heterogeneous computing systems that contain a mixture of central processing units (CPUs), graphics processing units (GPUs), and special purpose machine learning hardware, for example Tensor Processing Units (TPUs). [1] [2]

Contents

Each component processes information differently. Often data has to move among processors, which can create bottlenecks, with one processor starving while waiting on another to finish. [1]

Architecture

The system defines virtual processors and virtual operations (VOPs). VOPs decompose into one or more high-level operations (HLOPs). It then distributes the operations across the processors. The runtime system then dynamically maps virtual processors to physical processors, assessing resource availability in order to keep all the processors busy. The scheduler employs a light-weight, quality-aware work-stealing (QAWS) policy. [1]

Conventional runtimes use assign one processor (set) to each subtask, leaving other types of processors idle. In other words, the CPU(s) run (possibly in parallel), then when that subtask completes, the next subtask is handed to the GPU(s). When they finish the next subtask is handed to the TPU(s). [2]

Adding software pipelining allows the second subtask to run using partial results from the first subtask, which improves resource utilization. [2]

SHMT takes things a step further, identifying subtasks that can run independently of others to the appropriate processor type, allow even better parallelism. Some subtasks can be performed on multiple processor types. SHMT can divide a single subtask across such processor types. Thus the fundamental breakthrough is to keep more processors working simultaneously, reducing time and energy costs. [2]

Benchmark

Researchers tested the concept using a typical smartphone configuration tweaked so that it resembled a data center server. [1]

The hardware was Nvidia's Jetson Nano module containing a quad-core ARM Cortex-A57 processor (CPU) and 128 Maxwell architecture GPU cores. A Google Edge TPU was connected via its M.2 Key E slot. The processors communicated via an onboard PCI Express (PCIe) interface. Shared data was hosted in a 4 GB 64-bit LPDDR4. The Edge TPU adds an 8 MB device memory. Ubuntu Linux 18.04 was the operating system. [1]

Compared to a conventional system performance increased by 1.95X boost, while energy consumption was reduced by 51%, on a range of benchmarks, including Black–Scholes, DCT8X8, DWT, FFT, Histogram, Hotspot, Laplacian, MF, Sobel, SRAD, and GMEAN. [1]

See also

Related Research Articles

<span class="mw-page-title-main">Central processing unit</span> Central computer component which executes instructions

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

<span class="mw-page-title-main">Process (computing)</span> Particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

<span class="mw-page-title-main">Thread (computing)</span> Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

<span class="mw-page-title-main">Symmetric multiprocessing</span> The equal sharing of all resources by multiple identical processors

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

Very long instruction word (VLIW) refers to instruction set architectures that are designed to exploit instruction level parallelism (ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel, whereas conventional central processing units (CPUs) mostly allow programs to specify instructions to execute in sequence only. VLIW is intended to allow higher performance without the complexity inherent in some other designs.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

<span class="mw-page-title-main">Instruction-level parallelism</span> Ability of computer instructions to be executed simultaneously with correct results

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

LynxSecure is a least privilege real-time separation kernel hypervisor from Lynx Software Technologies designed for safety and security critical applications found in military, avionic, industrial, and automotive markets.

<span class="mw-page-title-main">Asymmetric multiprocessing</span>

An asymmetric multiprocessing system is a multiprocessor computer system where not all of the multiple interconnected central processing units (CPUs) are treated equally. For example, a system might allow only one CPU to execute operating system code or might allow only one CPU to perform I/O operations. Other AMP systems might allow any CPU to execute operating system code and perform I/O operations, so that they were symmetric with regard to processor roles, but attached some or all peripherals to particular CPUs, so that they were asymmetric with respect to the peripheral attachment.

Heterogeneous System Architecture (HSA) is a cross-vendor set of specifications that allow for the integration of central processing units and graphics processors on the same bus, with shared memory and tasks. The HSA is being developed by the HSA Foundation, which includes AMD and ARM. The platform's stated aim is to reduce communication latency between CPUs, GPUs and other compute devices, and make these various devices more compatible from a programmer's perspective, relieving the programmer of the task of planning the moving of data between devices' disjoint memories.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

A domain-specific architecture (DSA) is a programmable computer architecture specifically tailored to operate very efficiently within the confines of a given application domain. The term is often used in contrast to general-purpose architectures, such as CPUs, that are designed to operate on any computer program.

References

  1. 1 2 3 4 5 6 McClure, Paul (February 22, 2024). "Software tweak doubles computer processing speed, halves energy use". New Atlas. Retrieved 2024-02-25.
  2. 1 2 3 4 Hsu, Kuan-Chieh; Tseng, Hung-Wei (2023-12-08). "Simultaneous and Heterogenous Multithreading". Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '23. New York, NY, USA: Association for Computing Machinery: 137–152. doi: 10.1145/3613424.3614285 . ISBN   979-8-4007-0329-4.