Simultaneous and heterogeneous multithreading

Last updated March 12, 2024

Simultaneous and heterogeneous multithreading (SHMT) is a software framework that takes advantage of heterogeneous computing systems that contain a mixture of central processing units (CPUs), graphics processing units (GPUs), and special purpose machine learning hardware, for example Tensor Processing Units (TPUs).^[1]^[2]

Architecture

The system defines virtual processors and virtual operations (VOPs). VOPs decompose into one or more high-level operations (HLOPs). It then distributes the operations across the processors. The runtime system then dynamically maps virtual processors to physical processors, assessing resource availability in order to keep all the processors busy. The scheduler employs a light-weight, quality-aware work-stealing (QAWS) policy.^[1]

Conventional runtimes use assign one processor (set) to each subtask, leaving other types of processors idle. In other words, the CPU(s) run (possibly in parallel), then when that subtask completes, the next subtask is handed to the GPU(s). When they finish the next subtask is handed to the TPU(s).^[2]

Adding software pipelining allows the second subtask to run using partial results from the first subtask, which improves resource utilization.^[2]

SHMT takes things a step further, identifying subtasks that can run independently of others to the appropriate processor type, allow even better parallelism. Some subtasks can be performed on multiple processor types. SHMT can divide a single subtask across such processor types. Thus the fundamental breakthrough is to keep more processors working simultaneously, reducing time and energy costs.^[2]

Benchmark

Researchers tested the concept using a typical smartphone configuration tweaked so that it resembled a data center server.^[1]

The hardware was Nvidia's Jetson Nano module containing a quad-core ARM Cortex-A57 processor (CPU) and 128 Maxwell architecture GPU cores. A Google Edge TPU was connected via its M.2 Key E slot. The processors communicated via an onboard PCI Express (PCIe) interface. Shared data was hosted in a 4 GB 64-bit LPDDR4. The Edge TPU adds an 8 MB device memory. Ubuntu Linux 18.04 was the operating system.^[1]

Compared to a conventional system performance increased by 1.95X boost, while energy consumption was reduced by 51%, on a range of benchmarks, including Black–Scholes, DCT8X8, DWT, FFT, Histogram, Hotspot, Laplacian, MF, Sobel, SRAD, and GMEAN.^[1]

Related Research Articles

A central processing unit (CPU)—also called a central processor or main processor—is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

Very long instruction word (VLIW) refers to instruction set architectures that are designed to exploit instruction level parallelism (ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel, whereas conventional central processing units (CPUs) mostly allow programs to specify instructions to execute in sequence only. VLIW is intended to allow higher performance without the complexity inherent in some other designs.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

In computing, single program, multiple data (SPMD) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

<span class="mw-page-title-main">Binary Modular Dataflow Machine</span>

Binary Modular Dataflow Machine (BMDFM) is a software package that enables running an application in parallel on shared memory symmetric multiprocessing (SMP) computers using the multiple processors to speed up the execution of single applications. BMDFM automatically identifies and exploits parallelism due to the static and mainly dynamic scheduling of the dataflow instruction sequences derived from the formerly sequential program.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

LynxSecure is a least privilege real-time separation kernel hypervisor from Lynx Software Technologies designed for safety and security critical applications found in military, avionic, industrial, and automotive markets.

An asymmetric multiprocessing system is a multiprocessor computer system where not all of the multiple interconnected central processing units (CPUs) are treated equally. For example, a system might allow only one CPU to execute operating system code or might allow only one CPU to perform I/O operations. Other AMP systems might allow any CPU to execute operating system code and perform I/O operations, so that they were symmetric with regard to processor roles, but attached some or all peripherals to particular CPUs, so that they were asymmetric with respect to the peripheral attachment.

Heterogeneous System Architecture (HSA) is a cross-vendor set of specifications that allow for the integration of central processing units and graphics processors on the same bus, with shared memory and tasks. The HSA is being developed by the HSA Foundation, which includes AMD and ARM. The platform's stated aim is to reduce communication latency between CPUs, GPUs and other compute devices, and make these various devices more compatible from a programmer's perspective, relieving the programmer of the task of planning the moving of data between devices' disjoint memories.

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.

A domain-specific architecture (DSA) is a programmable computer architecture specifically tailored to operate very efficiently within the confines of a given application domain. The term is often used in contrast to general-purpose architectures, such as CPUs, that are designed to operate on any computer program.

References

1 2 3 4 5 6 McClure, Paul (February 22, 2024). "Software tweak doubles computer processing speed, halves energy use". New Atlas. Retrieved 2024-02-25.
1 2 3 4 Hsu, Kuan-Chieh; Tseng, Hung-Wei (2023-12-08). "Simultaneous and Heterogenous Multithreading". Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '23. New York, NY, USA: Association for Computing Machinery: 137–152. doi: 10.1145/3613424.3614285 . ISBN 979-8-4007-0329-4.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:2-1] 1 2 3 4 5 6 McClure, Paul (February 22, 2024). "Software tweak doubles computer processing speed, halves energy use". New Atlas. Retrieved 2024-02-25.

[:3-2] 1 2 3 4 Hsu, Kuan-Chieh; Tseng, Hung-Wei (2023-12-08). "Simultaneous and Heterogenous Multithreading". Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '23. New York, NY, USA: Association for Computing Machinery: 137–152. doi: 10.1145/3613424.3614285 . ISBN 979-8-4007-0329-4.

[1]

[2]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Simultaneous and heterogeneous multithreading

Contents

Architecture

Benchmark

See also

Related Research Articles

References