Cell microprocessor implementations

Last updated

Cell microprocessors are multi-core processors that use cellular architecture for high performance distributed computing. The first commercial Cell microprocessor, the Cell BE, was designed for the Sony PlayStation 3. IBM designed the PowerXCell 8i for use in the Roadrunner supercomputer. [1]

Contents

Implementation

First edition Cell on 90 nm CMOS

Known Cell variants in 90 nm process
DesignationDie areaFirst disclosedEnhancement
DD1221 mm2ISSCC 2005
DD2235 mm2Cool Chips April 2005Enhanced PPE core

IBM has published information concerning two different versions of Cell in this process, an early engineering sample designated DD1, and an enhanced version designated DD2 intended for production.

The main enhancement in DD2 was a small lengthening of the die to accommodate a larger PPE core, which is reported to "contain more SIMD/vector execution resources" . Some preliminary information released by IBM references the DD1 variant. As a result, some early journalistic accounts of the Cell's capabilities now differ from production hardware.

Cell floorplan

Cell function units and footprint
Cell function unitAreaDescription
XDR interface5.7%Interface to Rambus system memory
memory controller4.4%Manages external memory and L2 cache
512 KiB L2 cache10.3%Cache memory for the PPE
PPE core11.1%PowerPC processor
test2.0%Unspecified "test and decode logic"
EIB3.1%Element interconnect bus linking processors
SPE (each) × 86.2%Synergistic coprocessing element
I/O controller6.6%External I/O logic
Rambus FlexIO5.7%External signalling for I/O pins

Powerpoint material accompanying an STI presentation given by Dr Peter Hofstee], includes a photograph of the DD2 Cell die overdrawn with functional unit boundaries which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:

SPE floorplan

SPU function units and footprint
SPU function
unit
AreaDescriptionPipe
single precision10.0%single precision FP execution uniteven
double precision4.4%double precision FP execution uniteven
simple fixed3.25%fixed point execution uniteven
issue control2.5%feeds execution units
forward macro3.75%feeds execution units
GPR6.25%general purpose register file
permute3.25%permute execution unitodd
branch2.5%branch execution unitodd
channel6.75%channel interface (three discrete blocks)odd
LS0–LS330.0%four 64 KiB blocks of local storeodd
MMU4.75%memory management unit
DMA7.5%direct memory access unit
BIU9.0%bus interface unit
RTB2.5%array built-in test block (ABIST)
ATO1.6%atomic unit for atomic DMA updates
HB0.5%obscure

Additional details concerning the internal SPE implementation have been disclosed by IBM engineers, including Peter Hofstee, IBM's chief architect of the synergistic processing element, in a scholarly IEEE publication.

This document includes a photograph of the 2.54 mm × 5.81 mm SPE, as implemented in 90-nm SOI. In this technology, the SPE contains 21 million transistors of which 14 million are contained in arrays (a term presumably designating register files and the local store) and 7 million transistors are logic. This photograph is overdrawn with functional unit boundaries, which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:

Understanding the dispatch pipes is important to write efficient code. In the SPU architecture, two instructions can be dispatched (started) in each clock cycle using dispatch pipes designated even and odd. The two pipes provide different execution units, as shown in the table above. As IBM partitioned this, most of the arithmetic instructions execute on the even pipe, while most of the memory instructions execute on the odd pipe. The permute unit is closely associated with memory instructions as it serves to pack and unpack data structures located in memory into the SIMD multiple operand format that the SPU computes on most efficiently.

Unlike other processor designs providing distinct execution pipes, each SPU instruction can only dispatch on one designated pipe. In competing designs, more than one pipe might be designed to handle extremely common instructions such as add, permitting more two or more of these instructions to be executed concurrently, which can serve to increase efficiency on unbalanced workflows. In keeping with the extremely Spartan design philosophy, for the SPU no execution units are multiply provisioned.

Understanding the limitations of the restrictive two pipeline design is one of the key concepts a programmer must grasp to write efficient SPU code at the lowest level of abstraction. For programmers working at higher levels of abstraction, a good compiler will automatically balance pipeline concurrency where possible.

SPE power and performance

Relationship of speed to temperature
VoltageFrequencyPowerDie Temp.
0.9 V2.0 GHz1 W25 °C
0.9 V3.0 GHz2 W27 °C
1.0 V3.8 GHz3 W31 °C
1.1 V4.0 GHz4 W38 °C
1.2 V4.4 GHz7 W47 °C
1.3 V5.0 GHz11 W63 °C

As tested by IBM under a heavy transformation and lighting workload [average IPC of 1.4], the performance profile of this implementation for a single SPU processor is qualified as follows:

The entry for 2.0 GHz operation at 0.9 V represents a low power configuration. Other entries show the peak stable operating frequency achieved with each voltage increment. As a general rule in CMOS circuits, power dissipation rises in a rough relationship to V2F, the square of the voltage times the operating frequency.

Though the power measurements provided by the IBM authors lack precision they convey a good sense of the overall trend. These figures show the part is capable of running above 5 GHz under test lab conditions—though at a die temperature too hot for standard commercial configurations. The first Cell processors made commercially available were rated by IBM to run at 3.2 GHz, an operating speed where this chart suggests a SPU die temperature in a comfortable vicinity of 30 degrees.

Note that a single SPU represents 6% of the Cell processor's die area. The power figures given in the table above represent just a small portion of the overall power budget.

IBM has publicly announced their intention to implement Cell on a future technology below the 90 nm node to improve power consumption. Reduced power consumption could potentially allow the existing design to be boosted to 5 GHz or above without exceeding the thermal constraints of existing products.

Cell at 65 nm

The first shrink of Cell was at the 65 nm node. The reduction to 65 nm reduced the existing 230 mm2 die based on the 90 nm process to half its current size, about 120 mm2, greatly reducing IBM's manufacturing cost as well.

On 12 March 2007, IBM announced that it started producing 65 nm Cells in its East Fishkill fab. The chips produced there are apparently only for IBMs own Cell blade servers, which were the first to get the 65 nm Cells. Sony introduced the third generation of the PS3 in November 2007, the 40GB model without PS2-compatibility which was confirmed to use the 65 nm Cell. Thanks to the shrunk Cell, power consumption was reduced from 200 W to 135 W.

At first it was only known that the 65 nm-Cells clock up to 6 GHz and run on 1.3 V core voltage, as demonstrated on the ISSCC 2007. This would have given the chip a theoretical peak performance of 384 GFLOPS in FP8 quarter precision (48 GFLOPs in FP64 dual precision), a significant improvement to the 204.8 GFLOPS peak (25.6 GFLOPs FP64 dual precision) that a 90 nm 3.2 GHz Cell could provide with 8 active SPUs. IBM further announced it implemented new power-saving features and a dual power supply for the SRAM array. This version was not yet the long-rumoured "Cell+" with enhanced Double Precision floating point performance, which first saw the light of day mid-2008 in the Roadrunner supercomputer in the form of QS22 PowerXCell blades. Although IBM talked about and even showed higher-clocked Cells before, clock speed has remained constant at 3.2 GHz, even for the double precision enabled "Cell+" of the Roadrunner. By keeping clockspeed constant, IBM has instead opted to reduce power consumption. PowerXCell clusters even best IBMs Blue Gene clusters (371 MFLOPS/watt), which are far more power-efficient already than clusters made up of conventional CPUs (265 MFLOPS/watt and lower).

Future editions in CMOS

Prospects at 45 nm

At ISSCC 2008, IBM announced Cell at the 45 nm node. IBM said it would require 40 percent less power at the same clockspeed than its 65 nm predecessor and that the die area would shrink by 34 percent. The 45 nm Cell requires less cooling and allows for cheaper production, also through the use of a much smaller heatsink. Mass production was initially slotted to begin in late 2008 but was moved to early 2009.

Prospects beyond 45 nm

Sony, IBM and Toshiba announced to begin work on a Cell as small as 32 nm in January 2006, but since process shrinks in fabs usually happen on a global and not an individual chip scale, this was merely as a public commitment to take Cell to 32 nm.

Related Research Articles

<span class="mw-page-title-main">PowerPC 970</span>

The PowerPC 970, PowerPC 970FX, and PowerPC 970MP are 64-bit PowerPC processors from IBM introduced in 2002. When used in PowerPC-based Macintosh computers, Apple referred to them as the PowerPC G5.

Cell is a 64-bit multi-core microprocessor microarchitecture that combines a general-purpose PowerPC core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation.

<span class="mw-page-title-main">POWER7</span> 2010 family of multi-core microprocessors by IBM

POWER7 is a family of superscalar multi-core microprocessors based on the Power ISA 2.06 instruction set architecture released in 2010 that succeeded the POWER6 and POWER6+. POWER7 was developed by IBM at several sites including IBM's Rochester, MN; Austin, TX; Essex Junction, VT; T. J. Watson Research Center, NY; Bromont, QC and IBM Deutschland Research & Development GmbH, Böblingen, Germany laboratories. IBM announced servers based on POWER7 on 8 February 2010.

The PowerPC 400 family is a line of 32-bit embedded RISC processor cores based on the PowerPC or Power ISA instruction set architectures. The cores are designed to fit inside specialized applications ranging from system-on-a-chip (SoC) microcontrollers, network appliances, application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) to set-top boxes, storage devices and supercomputers.

<span class="mw-page-title-main">Broadway (processor)</span> 32-bit CPU for the Wii

Broadway is the codename of the 32-bit central processing unit (CPU) used in Nintendo's Wii home video game console. It was designed by IBM, and was initially produced using a 90 nm SOI process and later produced with a 65 nm SOI process.

<span class="mw-page-title-main">Roadrunner (supercomputer)</span>

Roadrunner was a supercomputer built by IBM for the Los Alamos National Laboratory in New Mexico, USA. The US$100-million Roadrunner was designed for a peak performance of 1.7 petaflops. It achieved 1.026 petaflops on May 25, 2008, to become the world's first TOP500 LINPACK sustained 1.0 petaflops system.

<span class="mw-page-title-main">Tesla (microarchitecture)</span> GPU microarchitecture by Nvidia

Tesla is the codename for a GPU microarchitecture developed by Nvidia, and released in 2006, as the successor to Curie microarchitecture. It was named after the pioneering electrical engineer Nikola Tesla. As Nvidia's first microarchitecture to implement unified shaders, it was used with GeForce 8 Series, GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs collectively manufactured in 90 nm, 80 nm, 65 nm, 55 nm, and 40 nm. It was also in the GeForce 405 and in the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules.

Intel Teraflops Research Chip is a research manycore processor containing 80 cores, using a network-on-chip architecture, developed by Intel's Tera-Scale Computing Research Program. It was manufactured using a 65 nm CMOS process with eight layers of copper interconnect and contains 100 million transistors on a 275 mm2 die. Its design goal was to demonstrate a modular architecture capable of a sustained performance of 1.0 TFLOPS while dissipating less than 100 W. Research from the project was later incorporated into Xeon Phi. The technical lead of the project was Sriram R. Vangal.

<span class="mw-page-title-main">PlayStation 3 technical specifications</span> Overview of the PlayStation 3 technical specifications

The PlayStation 3 technical specifications describe the various components of the PlayStation 3 (PS3) video game console.

<span class="mw-page-title-main">SpursEngine</span>

SpursEngine is a microprocessor from Toshiba built as a media oriented coprocessor, designed for 3D- and video processing in consumer electronics such as set-top boxes and computers. The SpursEngine processor is also known as the Quad Core HD processor. Announced 20 September 2007.

<span class="mw-page-title-main">Peter Hofstee</span> Dutch physicist and computer scientist (born 1962)

Harm Peter Hofstee is a Dutch physicist and computer scientist who currently is a distinguished research staff member at IBM Austin, USA, and a part-time professor in Big Data Systems at Delft University of Technology, Netherlands.

The ZEGO is a rackmount server platform built by Sony, targeted for the video post-production and broadcast markets. The platform is based on Sony's PlayStation 3 as it features both the Cell Processor as well as the RSX 'Reality Synthesizer'. It is aimed to greatly speed up postproduction work, 3D rendering and video processing. In some respects it is rather similar to IBM's QS20/21/22 blades, although Sony seems to target the DCC markets rather than scientific like IBM, which can be seen by the inclusion of the RSX graphics processor in the ZEGO platform.

QPACE is a massively parallel and scalable supercomputer designed for applications in lattice quantum chromodynamics.

The SPARC64 V (Zeus) is a SPARC V9 microprocessor designed by Fujitsu. The SPARC64 V was the basis for a series of successive processors designed for servers, and later, supercomputers.

The IBM A2 is an open source massively multicore capable and multithreaded 64-bit Power ISA processor core designed by IBM using the Power ISA v.2.06 specification. Versions of processors based on the A2 core range from a 2.3 GHz version with 16 cores consuming 65 W to a less powerful, four core version, consuming 20 W at 1.4 GHz.

Zero ASIC Corporation, formerly Adapteva, Inc., is a fabless semiconductor company focusing on low power many core microprocessor design. The company was the second company to announce a design with 1,000 specialized processing cores on a single integrated circuit.

<span class="mw-page-title-main">Espresso (processor)</span> 32-bit CPU for the Wii U

Espresso is the codename of the 32-bit central processing unit (CPU) used in Nintendo's Wii U video game console. It was designed by IBM, and was produced using a 45 nm silicon-on-insulator process. The Espresso chip resides together with a GPU from AMD on an MCM manufactured by Renesas. It was revealed at E3 2011 in June 2011 and released in November 2012.

<span class="mw-page-title-main">Fermi (microarchitecture)</span> GPU microarchitecture by Nvidia

Fermi is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia, first released to retail in April 2010, as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce 400 series and GeForce 500 series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 series, GeForce 700 series, and GeForce 800 series, in the latter two only in mobile GPUs. In the workstation market, Fermi found use in the Quadro x000 series, Quadro NVS models, as well as in Nvidia Tesla computing modules. All desktop Fermi GPUs were manufactured in 40nm, mobile Fermi GPUs in 40nm and 28nm. Fermi is the oldest microarchitecture from NVIDIA that received support for Microsoft's rendering API Direct3D 12 feature_level 11.

The Power Processing Element (PPE) comprises a Power Processing Unit (PPU) and a 512 KB L2 cache. In most instances the PPU is used in a PPE. The PPU is a 64-bit dual-threaded in-order PowerPC 2.02 microprocessor core designed by IBM for use primarily in the game consoles PlayStation 3 and Xbox 360, but has also found applications in high performance computing in supercomputers such as the record setting IBM Roadrunner.

References

  1. Kevin J. Barker, Kei Davis, Adolfy Hoisie, Darren J. Kerbyson, Mike Lang, Scott Pakin, Jose C. Sancho. "Entering the Petaflop Era:The Architecture and Performance of Roadrunner".