Barrel processor

Last updated

A barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading. Unlike simultaneous multithreading in modern superscalar architectures, it generally does not allow execution of multiple instructions in one cycle.

Contents

Like preemptive multitasking, each thread of execution is assigned its own program counter and other hardware registers (each thread's architectural state). A barrel processor can guarantee that each thread will execute one instruction every n cycles, unlike a preemptive multitasking machine, that typically runs one thread of execution for tens of millions of cycles, while all other threads wait their turn.

A technique called C-slowing can automatically generate a corresponding barrel processor design from a single-tasking processor design. An n-way barrel processor generated this way acts much like n separate multiprocessing copies of the original single-tasking processor, each one running at roughly 1/n the original speed.[ citation needed ]

History

One of the earliest examples of a barrel processor was the I/O processing system in the CDC 6000 series supercomputers. These executed one instruction (or a portion of an instruction) from each of 10 different virtual processors (called peripheral processors) before returning to the first processor. [1]

One motivation for barrel processors was to reduce hardware costs. In the case of the CDC 6x00 PPUs, the digital logic of the processor was much faster than the core memory, so rather than having ten separate processors, there are ten separate core memory units for the PPUs, but they all share the single set of processor logic.

Another example is the Honeywell 800, which had 8 groups of registers, allowing up to 8 concurrent programs. After each instruction, the processor would (in most cases) switch to the next active program in sequence. [2]

Barrel processors have also been used as large-scale central processors. The Tera MTA (1988) was a large-scale barrel processor design with 128 threads per core. [3] [4] The MTA architecture has seen continued development in successive products, such as the Cray Urika-GD, originally introduced in 2012 (as the YarcData uRiKA) and targeted at data-mining applications. [5]

Barrel processors are also found in embedded systems, where they are particularly useful for their deterministic real-time thread performance. An example is the XMOS XCore XS1 (2007), a four-stage barrel processor with eight threads per core. The XS1 is found in Ethernet, USB, audio, and control devices, and other applications where I/O performance is critical. Barrel processors have also been used in specialized devices such as the eight-thread Ubicom IP3023 network I/O processor (2004). Some 8-bit microcontrollers by Padauk Technology feature barrel processors with up to 8 threads per core.

Advantages compared to single-threaded processors

A single-tasking processor spends a lot of time idle, not doing anything useful whenever a cache miss or pipeline stall occurs. Advantages to employing barrel processors over single-tasking processors include:

Disadvantages compared to single-threaded processors

There are a few disadvantages to barrel processors.

See also

Related Research Articles

Process (computing) particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed by one or many threads. It contains the program code and its activity. Depending on the operating system (OS), a process may be made up of multiple threads of execution that execute instructions concurrently.

Thread (computing) smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. Multiple threads can exist within one process, executing concurrently and sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

Superscalar processor CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor that can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows for more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

CDC 6600 computer

The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM 7030 Stretch, by a factor of three. With performance of up to three megaFLOPS, the CDC 6600 was the world's fastest computer from 1964 to 1969, when it relinquished that status to its successor, the CDC 7600.

Hyper-threading Intels proprietary simultaneous multithreading implementation on x86 microprocessors

Hyper-threading is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations performed on x86 microprocessors. It first appeared in February 2002 on Xeon server processors and in November 2002 on Pentium 4 desktop CPUs. Later, Intel included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.

In computer science, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of CPUs made by Intel. The first CPU to use this architecture was the Willamette-core Pentium 4, released on November 20, 2000 and the first of the Pentium 4 CPUs; all subsequent Pentium 4 and Pentium D variants have also been based on NetBurst. In mid-2004, Intel released the Foster core, which was also based on NetBurst, thus switching the Xeon CPUs to the new architecture as well. Pentium 4-based Celeron CPUs also use the NetBurst architecture.

CDC 7600 supercomputer

The CDC 7600 was the Seymour Cray-designed successor to the CDC 6600, extending Control Data's dominance of the supercomputer field into the 1970s. The 7600 ran at 36.4 MHz and had a 65 Kword primary memory using magnetic core and variable-size secondary memory. It was generally about ten times as fast as the CDC 6600 and could deliver about 10 MFLOPS on hand-compiled code, with a peak of 36 MFLOPS. In addition, in benchmark tests in early 1970 it was shown to be slightly faster than its IBM rival, the IBM System/360, Model 195. When the system was released in 1969, it sold for around $5 million in base configurations, and considerably more as options and features were added.

Temporal multithreading is one of the two main forms of multithreading that can be implemented on computer processor hardware, the other being simultaneous multithreading. The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given pipeline stage in a given cycle. In temporal multithreading the number is one, while in simultaneous multithreading the number is greater than one. Some authors use the term super-threading synonymously.

Microarchitecture the way a given instruction set architecture (ISA) is implemented on a processor

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

MAJC was a Sun Microsystems multi-core, multithreaded, very long instruction word (VLIW) microprocessor design from the mid-to-late 1990s. Originally called the UltraJava processor, the MAJC processor was targeted at running Java programs, whose "late compiling" allowed Sun to make several favourable design decisions. The processor was released into two commercial graphical cards from Sun. Lessons learned regarding multi-threads on a multi-core processor provided a basis for later OpenSPARC implementations such as the UltraSPARC T1.

The Cray XMT is the third generation of the Cray MTA supercomputer architecture originally developed by Tera. The earlier generations were called the Cray MTA and the Cray MTA-2. The XMT makes the MTA's multithreaded processors, now dubbed Threadstorm, compatible with the 1207-pin Socket F used by AMD Opteron processors. The Threadstorm processors are plugged into systems otherwise identical to the Cray XT4.

Multithreading (computer architecture) ability of a central processing unit (CPU) or a single core in a multi-core processor to execute multiple processes or threads concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

XCore XS1-G4

The XS1-G4 is a processor designed by XMOS. It is a 32-bit quad-core processor, where each core runs up to 8 concurrent threads. It was available as of Autumn 2008 running at 400 MHz. Each thread can run at up to 100 MHz; four threads follow each other through the pipeline, resulting in a top speed of 1.6 GIPS for four cores if 16 threads are running. The XS1-G4 is a distributed memory multi core processor, requiring the end user and compiler to deal with data distribution. When more than 4 threads execute, the 400 MIPS of each core is equally distributed over all active threads. This allows the use of extra threads in order to hide latency.

XCore XS1-L1

The XS1-L1 is a 32-bit processor designed by XMOS, featuring support for up to 8 concurrent threads. It was available as of June 2009 running at 400 MHz. As of April 2010 500 MHz versions are available. Each thread can run at up to 125 MHz; four threads follow each other through the pipeline, resulting in a top speed of 500 MIPS if at least four threads are active. The 500 MIPS of each core is equally distributed over all active threads. This allows the use of extra threads in order to hide latency.

XCore XS1-AnA

The XS1-AnA is a family of processors designed by XMOS. It is based on a 32-bit architecture, that runs up to eight concurrent threads, with built-in analog-to-digital converters (ADC), oscillator, and power supplies. It will be available from autumn 2013 running at 500 MHz. Each thread can run at up to 125 MHz.

Power Processing Element in microprocessor architecture

The Power Processing Unit (PPU) is a 64-bit dual-threaded in-order PowerPC 2.02 microprocessor core designed by IBM for use primarily in the game consoles PlayStation 3 and Xbox 360, but has also found applications in high performance computing in supercomputers such as the record setting IBM Roadrunner.

Latency oriented processor architecture is the microarchitecture of a microprocessor designed to serve a serial computing thread with a low latency. This is typical of most Central Processing Units (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases. Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total throughput of the system, rather than the service latencies for all individual threads that they work on.

XCORE-200

The xCORE200 is a 32-bit processor designed by XMOS, featuring support for two tiles with up to 8 concurrent threads each. It was launched in March 2015, and available as of June 2015 running at 500 MHz. Each thread can run at up to 100 MHz, and threads may be able to execute 2 instructions in a clock cycle. Five threads follow each other through the pipeline, resulting in a top speed of 2000 MIPS, and a speed of at least 1000 MIPS. The issue slots for each tile are equally distributed over all active threads on that tile. This allows the use of extra threads in order to hide latency. The XCORE-200 processor is used for, amongst others, voice interfaces and audio connectivity.

References

  1. CDC Cyber 170 Computer Systems; Models 720, 730, 750, and 760; Model 176 (Level B); CPU Instruction Set; PPU Instruction Set -- See page 2-44 for an illustration of the rotating "barrel".
  2. Honeywell 800 Programmers' Reference Manual (PDF). 1960. p. 17.
  3. "Archived copy". Archived from the original on 2012-02-22. Retrieved 2012-08-11.CS1 maint: archived copy as title (link)
  4. "Archived copy". Archived from the original on 2014-07-12. Retrieved 2014-08-19.CS1 maint: archived copy as title (link)
  5. "Cray's YarcData division launches new big data graph appliance" (Press release). Seattle, WA and Santa Clara, CA: Cray Inc. February 29, 2012. Retrieved 2017-08-24.