Micro-thread (multi-core)

Last updated

Micro-threads for multi-core and many-cores processors is a mechanism to hide memory latency similar to multi-threading architectures. However, it is done in software for multi-core processors such as the Cell Broadband Engine to dynamically hide latencies that occur due to memory latency or I/O operations.

Contents

Introduction

Micro-threading is a software-based threading framework that creates small threads inside multi-core or many-core processors. Each core may have two or more tiny threads that utilize its idle time. It is like hyper-threading invented by Intel or the general multi-threading architecture in modern micro-processors. It enables the existence of more than one thread running on the same core without performing expensive context switching to system's main memory, even if this core does not have multi-threading hardware logic. Micro-threads mainly hide memory latency inside each core by over lapping computations with memory requests. The main difference between micro-threads and current threading models is that micro-threads context switching overhead is very small. For example, the overhead micro-threads implementation on Cell Broadband Engine is 160 nano seconds; meanwhile, the overhead of context switching of the whole core's (SPE) thread is around 2000 micro-seconds. This low overhead is due to three main factors. First, micro-threads are very small. Each micro-thread runs one or two simple but critical functions. Second, micro-threads context include only the register file of the core currently the micro-thread is executing on. Third, micro-threads are context switched to core's dedicated cache, which makes this process very fast and efficient.

Background

As microprocessors are becoming faster, mainly because of the cores being added every few months, memory latency gap is becoming wider. Memory latency was few cycles in 1980 and it is reaching nowadays almost 1000 cycles. If the micro-processor has enough cores and hopefully they are not sending requests to the main memory at the same time, there will be partial aggregate hiding of memory latency. Some cores might be executing while others are waiting for memory response. This is not the best situation for multi-core processors. High performance computing experts are striving to keep all cores busy all the time. So, if each core is kept busy all the time, a complete utilization of the whole micro-processor is possible. Creating software based threads won't solve the problem for one obvious reason. Context switching threads to main memory is much expensive operation when compared to memory latency. For example, in Cell Broadband Engine context switching any of the core's thread takes 2000 micro-seconds in best cases. Some software techniques like double or multi-buffering may solve the memory latency problem. However, they can be used in regular algorithms, where the program knows where is the next data chunk to retrieve from memory; in this case it sends request to memory while it is processing previously request data. However, this technique won't work if it the program does not know the next data chunk to retrieve from memory. In other words, it won't work in combinatorial algorithms, such as tree spanning or random list ranking. In addition, multi-buffering assumes that memory latency is constant and can be hidden by statically. However, reality shows that memory latency changes from application to another. It depends on the overall load on microprocessor's shared resources, such as the rate of memory requests shared cores interconnections.

Current implementation

Currently micro-threading is implemented on the Cell Broadband Engine. [1] Three to fivefold performance improvement could be achieved. Currently it is proven for regular and combinatorial algorithms. Some other efforts are trying to prove its viability for scientific algorithms.

Performance

Micro-threads provide a very good solution to hide memory latency best based on the run-time utilization of the microprocessor. For example, if the memory latency is very high compared to processing and context switching time, more micro-threads can be added; this happens when large data chunks are requested from memory or there are many memory hot-spots. If this ration is small, less micro-threads might be introduced at run-time. This depends on factors related to the implemented application and system's run-time factors.

Critique

Although micro-threads provide a promising model to hide memory latency for multi and many-core processors, it has some important critiques that need to be addressed:

Related Research Articles

A real-time operating system (RTOS) is an operating system (OS) intended to serve real-time applications that process data as it comes in, typically without buffer delays. Processing time requirements are measured in tenths of seconds or shorter increments of time. A real-time system is a time-bound system which has well-defined, fixed time constraints. Processing must be done within the defined constraints or the system will fail. They either are event-driven or time-sharing. Event-driven systems switch between tasks based on their priorities, while time-sharing systems switch the task based on clock interrupts. Most RTOSs use a pre-emptive scheduling algorithm.

Thread (computing) Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. Multiple threads can exist within one process, executing concurrently and sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

Embedded system Computer system with a dedicated function within a larger mechanical or electrical system

An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is embedded as part of a complete device often including electrical or electronic hardware and mechanical parts. Because an embedded system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints. Embedded systems control many devices in common use today. In 2009 it was estimated that ninety-eight percent of all microprocessors manufactured were used in embedded systems.

System on a chip Integrated circuit that incorporates the components of a computer

A system on a chip is an integrated circuit that integrates all or most components of a computer or other electronic system. These components almost always include a central processing unit (CPU), memory, input/output ports and secondary storage, often alongside other components such as radio modems and a graphics processing unit (GPU) – all on a single substrate or microchip. It may contain digital, analog, mixed-signal, and often radio frequency signal processing functions.

Parallel computing Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Digital signal processor Specialized microprocessor optimized for digital signal processing

A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on MOS integrated circuit chips. They are widely used in audio signal processing, telecommunications, digital image processing, radar, sonar and speech recognition systems, and in common consumer electronic devices such as mobile phones, disk drives and high-definition television (HDTV) products.

In computing, scheduling is the method by which work is assigned to resources that complete the work. The work may be virtual computation elements such as threads, processes or data flows, which are in turn scheduled onto hardware resources such as processors, network links or expansion cards.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

A translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location. It is a part of the chip's memory-management unit (MMU). The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.

Thread pool

In computer programming, a thread pool is a software design pattern for achieving concurrency of execution in a computer program. Often also called a replicated workers or worker-crew model, a thread pool maintains multiple threads waiting for tasks to be allocated for concurrent execution by the supervising program. By maintaining a pool of threads, the model increases performance and avoids latency in execution due to frequent creation and destruction of threads for short-lived tasks. The number of available threads is tuned to the computing resources available to the program, such as a parallel task queue after completion of execution.

Cell is a multi-core microprocessor microarchitecture that combines a general-purpose PowerPC core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with separate instruction-specific and data-specific caches at level 1.

Hardware acceleration is the use of computer hardware made to perform some functions more efficiently than in software running on a general-purpose central processing unit (CPU). Any transformation of data or routine that can be computed can be calculated purely in software running on a generic CPU, purely in custom-made hardware, or in some mix of both. An operation can be computed faster in application-specific integrated circuits (ASICs) designed or programmed to compute the operation than specified in software and performed on a general-purpose computer processor. Each approach has advantages and disadvantages. The implementation of computing tasks in hardware to decrease latency and increase throughput is known as hardware acceleration.

MAJC was a Sun Microsystems multi-core, multithreaded, very long instruction word (VLIW) microprocessor design from the mid-to-late 1990s. Originally called the UltraJava processor, the MAJC processor was targeted at running Java programs, whose "late compiling" allowed Sun to make several favourable design decisions. The processor was released into two commercial graphical cards from Sun. Lessons learned regarding multi-threads on a multi-core processor provided a basis for later OpenSPARC implementations such as the UltraSPARC T1.

Multi-core processor Microprocessor with more than one processing unit

A multi-core processor is a computer processor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

In computer storage, fragmentation is a phenomenon in which storage space is used inefficiently, reducing capacity or performance and often both. The exact consequences of fragmentation depend on the specific system of storage allocation in use and the particular form of fragmentation. In many cases, fragmentation leads to storage space being "wasted", and in that case the term also refers to the wasted space itself. For other systems the space used to store given data is the same regardless of the degree of fragmentation.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor ("CPU"), scratchpad refers to a special high-speed memory circuit used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc.

The Sieve C++ Parallel Programming System is a C++ compiler and parallel runtime designed and released by Codeplay that aims to simplify the parallelization of code so that it may run efficiently on multi-processor or multi-core systems. It is an alternative to other well-known parallelisation methods such as OpenMP, the RapidMind Development Platform and Threading Building Blocks (TBB).

The Cray MTA, formerly known as the Tera MTA, is a supercomputer architecture based on thousands of independent threads, fine-grain communication and synchronization between threads, and latency tolerance for irregular computations.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

References

  1. Ahmed, M.; R. Ammar; S. Rajasekaran (2008), "SPENK: adding another level of parallelism on the cell broadband engine" (pdf), 1st international forum on Next-generation multicore/manycore technologies, Cairo, Egypt: ACM, pp. 1–10, retrieved 2009-03-04