Barrel processor

Last updated December 21, 2024

A barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading. Unlike simultaneous multithreading in modern superscalar architectures, it generally does not allow execution of multiple instructions in one cycle.

Like preemptive multitasking, each thread of execution is assigned its own program counter and other hardware registers (each thread's architectural state). A barrel processor can guarantee that each thread will execute one instruction every n cycles, unlike a preemptive multitasking machine, that typically runs one thread of execution for tens of millions of cycles, while all other threads wait their turn.

A technique called C-slowing can automatically generate a corresponding barrel processor design from a single-tasking processor design. An n-way barrel processor generated this way acts much like n separate multiprocessing copies of the original single-tasking processor, each one running at roughly 1/n the original speed.^{[ citation needed ]}

History

One of the earliest examples of a barrel processor was the I/O processing system in the CDC 6000 series supercomputers. These executed one instruction (or a portion of an instruction) from each of 10 different virtual processors (called peripheral processors or PPs) before returning to the first processor.^[1] From CDC 6000 series we read that "The peripheral processors are collectively implemented as a barrel processor. Each executes routines independently of the others. They are a loose predecessor of bus mastering or direct memory access."

One motivation for barrel processors was to reduce hardware costs. In the case of the CDC 6x00 PPUs, the digital logic of the processor was much faster than the core memory, so rather than having ten separate processors, there are ten separate core memory units for the PPUs, but they all share the single set of processor logic.

Another example is the Honeywell 800, which had 8 groups of registers, allowing up to 8 concurrent programs. After each instruction, the processor would (in most cases) switch to the next active program in sequence.^[2]

Barrel processors have also been used as large-scale central processors. The Tera MTA (1988) was a large-scale barrel processor design with 128 threads per core.^[3]^[4] The MTA architecture has seen continued development in successive products, such as the Cray Urika-GD, originally introduced in 2012 (as the YarcData uRiKA) and targeted at data-mining applications.^[5]

Barrel processors are also found in embedded systems, where they are particularly useful for their deterministic real-time thread performance.

An early example is the “Dual CPU” version of the four-bit COP400 that was introduced by National Semiconductor in 1981. This single-chip microcontroller contains two ostensibly independent CPUs that share instructions, memory, and most IO devices. In reality, the dual CPUs are a single two-thread barrel processor. It works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources such as ALU, buses, and memory. Separate architectural states are established with duplicated A (accumulators), B (pointer registers), C (carry flags), N (stack pointers), and PC (program counters).^[6]

Another example is the XMOS XCore XS1 (2007), a four-stage barrel processor with eight threads per core. (Newer processors from XMOS also have the same type of architecture.) The XS1 is found in Ethernet, USB, audio, and control devices, and other applications where I/O performance is critical. When the XS1 is programmed in the 'XC' language, software controlled direct memory access may be implemented.

Barrel processors have also been used in specialized devices such as the eight-thread Ubicom IP3023 network I/O processor (2004). Some 8-bit microcontrollers by Padauk Technology feature barrel processors with up to 8 threads per core.

Comparison with single-threaded processors

Advantages

A single-tasking processor spends a lot of time idle, not doing anything useful whenever a cache miss or pipeline stall occurs. Advantages to employing barrel processors over single-tasking processors include:

The ability to do useful work on the other threads while the stalled thread is waiting.
Designing an n-way barrel processor with an n-deep pipeline is much simpler than designing a single-tasking processor because a barrel processor never has a pipeline stall and doesn't need feed-forward circuits.
For real-time applications, a barrel processor can guarantee that a "real-time" thread can execute with precise timing, no matter what happens to the other threads, even if some other thread locks up in an infinite loop or is continuously interrupted by hardware interrupts.

Disadvantages

There are a few disadvantages to barrel processors.

The state of each thread must be kept on-chip, typically in registers, to avoid costly off-chip context switches. This requires a large number of registers compared to typical processors.
Either all threads must share the same cache, which slows overall system performance, or there must be one unit of cache for each execution thread, which can significantly increase the transistor count and thus the cost of such a CPU. However, in hard real-time embedded systems where barrel processors are often found, memory access costs are typically calculated assuming worst-case cache behavior, so this is a minor concern.^{[ citation needed ]} Some barrel processors such as the XMOS XS1 do not have a cache at all.

Related Research Articles

A central processing unit (CPU), also called a central processor, main processor, or just processor, is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

<span class="mw-page-title-main">Computer multitasking</span> Concurrent execution of multiple processes

In computing, multitasking is the concurrent execution of multiple tasks over a certain period of time. New tasks can interrupt already started ones before they finish, instead of waiting for them to end. As a result, a computer executes segments of multiple tasks in an interleaved manner, while the tasks share common processing resources such as central processing units (CPUs) and main memory. Multitasking automatically interrupts the running program, saving its state and loading the saved state of another program and transferring control to it. This "context switch" may be initiated at fixed time intervals, or the running program may be coded to signal to the supervisory software when it can be interrupted.

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM 7030 Stretch, by a factor of three. With performance of up to three megaFLOPS, the CDC 6600 was the world's fastest computer from 1964 to 1969, when it relinquished that status to its successor, the CDC 7600.

Multiprocessing is the use of two or more central processing units (CPUs) within a single computer system. The term also refers to the ability of a system to support more than one processor or the ability to allocate tasks between them. There are many variations on this basic theme, and the definition of multiprocessing can vary with context, mostly as a function of how CPUs are defined.

Hyper-threading is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations performed on x86 microprocessors. It was introduced on Xeon server processors in February 2002 and on Pentium 4 desktop processors in November 2002. Since then, Intel has included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically, ILP refers to the average number of instructions run per step of this parallel execution.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of central processing units (CPUs) made by Intel. The first CPU to use this architecture was the Willamette-core Pentium 4, released on November 20, 2000 and the first of the Pentium 4 CPUs; all subsequent Pentium 4 and Pentium D variants have also been based on NetBurst. In mid-2001, Intel released the Foster core, which was also based on NetBurst, thus switching the Xeon CPUs to the new architecture as well. Pentium 4-based Celeron CPUs also use the NetBurst architecture.

The CDC 7600 was designed by Seymour Cray to be the successor to the CDC 6600, extending Control Data's dominance of the supercomputer field into the 1970s. The 7600 ran at 36.4 MHz and had a 65 Kword primary memory using magnetic core and variable-size secondary memory. It was generally about ten times as fast as the CDC 6600 and could deliver about 10 MFLOPS on hand-compiled code, with a peak of 36 MFLOPS. In addition, in benchmark tests in early 1970 it was shown to be slightly faster than its IBM rival, the IBM System/360, Model 195. When the system was released in 1967, it sold for around $5 million in base configurations, and considerably more as options and features were added.

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

A multi-core processor (MCP) is a microprocessor on a single integrated circuit (IC) with two or more separate central processing units (CPUs), called cores to emphasize their multiplicity. Each core reads and executes program instructions, specifically ordinary CPU instructions. However, the MCP can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single IC die, known as a chip multiprocessor (CMP), or onto multiple dies in a single chip package. As of 2024, the microprocessors used in almost all new personal computers are multi-core.

The CDC 6000 series is a discontinued family of mainframe computers manufactured by Control Data Corporation in the 1960s. It consisted of the CDC 6200, CDC 6300, CDC 6400, CDC 6500, CDC 6600 and CDC 6700 computers, which were all extremely rapid and efficient for their time. Each is a large, solid-state, general-purpose, digital computer that performs scientific and business data processing as well as multiprogramming, multiprocessing, Remote Job Entry, time-sharing, and data management tasks under the control of the operating system called SCOPE. By 1970 there also was a time-sharing oriented operating system named KRONOS. They were part of the first generation of supercomputers. The 6600 was the flagship of Control Data's 6000 series.

Cray XMT is a scalable multithreaded shared memory supercomputer architecture by Cray, based on the third generation of the Tera MTA architecture, targeted at large graph problems. Presented in 2005, it supersedes the earlier unsuccessful Cray MTA-2. It uses the Threadstorm3 CPUs inside Cray XT3 blades. Designed to make use of commodity parts and existing subsystems for other commercial systems, it alleviated the shortcomings of Cray MTA-2's high cost of fully custom manufacture and support. It brought various substantial improvements over Cray MTA-2, most notably nearly tripling the peak performance, and vastly increased maximum CPU count to 8,192 and maximum memory to 128 TB, with a data TLB of maximal 512 TB.

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution.

The Power Processing Element (PPE) comprises a Power Processing Unit (PPU) and a 512 KB L2 cache. In most instances the PPU is used in a PPE. The PPU is a 64-bit dual-threaded in-order PowerPC 2.02 microprocessor core designed by IBM for use primarily in the game consoles PlayStation 3 and Xbox 360, but has also found applications in high performance computing in supercomputers such as the record setting IBM Roadrunner.

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

References

↑ CDC Cyber 170 Computer Systems; Models 720, 730, 750, and 760; Model 176 (Level B); CPU Instruction Set; PPU Instruction Set Archived 2016-03-03 at the Wayback Machine -- See page 2-44 for an illustration of the rotating "barrel".
↑ Honeywell 800 Programmers' Reference Manual (PDF). 1960. p. 17.
↑ "Archived copy". Archived from the original on 2012-02-22. Retrieved 2012-08-11.{{cite web}}: CS1 maint: archived copy as title (link)
↑ "Cray History". Archived from the original on 2014-07-12. Retrieved 2014-08-19.
↑ "Cray's YarcData division launches new big data graph appliance" (Press release). Seattle, WA and Santa Clara, CA: Cray Inc. February 29, 2012. Archived from the original on 2017-03-18. Retrieved 2017-08-24.
↑ "COPS Microcontrollers Data Book". National Semiconductor. Retrieved 19 January 2022.

External links

Soft peripherals Embedded.com article examines Ubicom's IP3023 processor
An Evaluation of the Design of the Gamma 60
Histoire et architecture du Gamma 60 (French and English)

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[cyber-1] CDC Cyber 170 Computer Systems; Models 720, 730, 750, and 760; Model 176 (Level B); CPU Instruction Set; PPU Instruction Set Archived 2016-03-03 at the Wayback Machine -- See page 2-44 for an illustration of the rotating "barrel".

[2] Honeywell 800 Programmers' Reference Manual (PDF). 1960. p. 17.

[tera_mta-3] "Archived copy". Archived from the original on 2012-02-22. Retrieved 2012-08-11.{{cite web}}: CS1 maint: archived copy as title (link)

[cray_mta-4] "Cray History". Archived from the original on 2014-07-12. Retrieved 2014-08-19.

[urika-5] "Cray's YarcData division launches new big data graph appliance" (Press release). Seattle, WA and Santa Clara, CA: Cray Inc. February 29, 2012. Archived from the original on 2017-03-18. Retrieved 2017-08-24.

[:1-6] "COPS Microcontrollers Data Book". National Semiconductor. Retrieved 19 January 2022.

[1]

[2]

[3]

[4]

[5]

[6]