R8000

Last updated November 13, 2023

The R8000 is a microprocessor chipset developed by MIPS Technologies, Inc. (MTI), Toshiba, and Weitek.^[1] It was the first implementation of the MIPS IV instruction set architecture. The R8000 is also known as the TFP, for Tremendous Floating-Point, its name during development.

History

Development of the R8000 started in the early 1990s at Silicon Graphics, Inc. (SGI). The R8000 was specifically designed to provide the performance of circa 1990s supercomputers with a microprocessor instead of a central processing unit (CPU) built from many discrete components such as gate arrays. At the time, the performance of traditional supercomputers was not advancing as rapidly as reduced instruction set computer (RISC) microprocessors. It was predicted that RISC microprocessors would eventually match the performance of more expensive and larger supercomputers at a fraction of the cost and size, making computers with this level of performance more accessible and enabling deskside workstations and servers to replace supercomputers in many situations.

First details of the R8000 emerged in April 1992 in an announcement by MIPS Computer Systems detailing future MIPS microprocessors. In March 1992, SGI announced it was acquiring MIPS Computer Systems, which became a subsidiary of SGI called MIPS Technologies, Inc. (MTI) in mid-1992. Development of the R8000 was transferred to MTI, where it continued. The R8000 was expected to be introduced in 1993, but it was delayed until mid-1994. The first R8000, a 75 MHz part, was introduced on 7 June 1994. It was priced at US$2,500 at the time. In mid-1995, a 90 MHz part appeared in systems from SGI. The R8000's high cost and narrow market (technical and scientific computing) restricted its market share, and although it was popular in its intended market, it was largely replaced with the cheaper and generally better performing R10000 introduced January 1996.

Users of the R8000 were SGI, who used it in their Power Indigo2 workstation, Power Challenge server, Power ChallengeArray cluster and Power Onyx visualization system. In the November 1994 TOP500 list, 50 systems out of 500 used the R8000. The highest ranked R8000-based systems were four Power Challenges at positions 154 to 157. Each had 18 R8000s.^[2] SGI claimed a floating-point throughput of 300 million instructions per second, a SPECfp92 rating of 310, and a more modest SPECint92 rating of 108.^[3]

Description

The chip set consisted of the R8000 microprocessor, the R8010 floating-point unit, two Tag RAMs, and the streaming cache. The R8000 is superscalar, capable of issuing up to four instructions per cycle, and executes instructions in program order. It has a five-stage integer pipeline.

R8000

The R8000 controlled the chip set and executed integer instructions. It contained the integer execution units, integer register file, primary caches and hardware for instruction fetch, branch prediction the translation lookaside buffers (TLBs).

In stage one, four instructions are fetched from the instruction cache. The instruction cache is 16 kB large, direct-mapped, virtually tagged and virtually indexed, and has a 32-byte line size. Instruction decoding and register reads occur during stage two, and branch instructions are resolved as well, leading to a one-cycle branch mispredict penalty. Load and store instructions begin execution in stage three, and integer instructions in stage four. Integer execution was delayed until stage four so that integer instructions which use the result of a load as an operand may be issued in the cycle after the load. Results are written to the integer register file in stage five.

The integer register file has nine read ports and four write ports. Four read ports supply operands to the two integer execution units (the branch unit was considered part of an integer unit). Another four read ports supply operands to the two address generators. Four ports are needed, rather than two, because of the base(register) + index(register) address style added in the MIPS IV ISA. The R8000 issues at most one integer store per cycle, and one final read port delivers the integer store data.

Two register file write ports are used to write results from the two integer functional units. The R8000 issues two integer loads per cycle, and the other two write ports are used to write the results of integer loads to the register file.

The level 1 data cache was organized as two redundant arrays, each of which had one read port and one write port. Integer stores were written to both arrays. Two loads could be processed in parallel, one on each array.

Integer functional units consisted of two integer units, a shift unit, a multiply-divide unit, and two address generator units. Multiply and divide instructions are executed in the multiply-divide unit, which is not pipelined. As a result, the latency for a multiply instruction is four cycles for 32-bit operands and six cycles for 64-bit. The latency for a divide instruction depends on the number of significant digits in the result and thus it varies from 21 to 73 cycles.

Loads and stores

Loads and stores begin execution in stage three. The R8000 has two address generation units (AGUs) that calculate virtual address for loads and stores. In stage four, the virtual addresses are translated to physical addresses by a dual-ported TLB that contains 384 entries and is three-way set associative. The 16 kB data cache is accessed in the same cycle. It is dual-ported, and is accessed via two 64-bit buses. It can service two loads or one load and one store per cycle. The cache is not protected by parity or by error correcting code (ECC). In the event of a cache miss, the data must be loaded from the streaming cache with an eight-cycle penalty. The cache is virtually indexed, physically tagged, direct mapped, has a 32-byte line size and uses a write-through with allocate protocol. If the loads hit in the data cache, the result is written to the integer register file in stage five.

R8010

The R8010 executed floating-point instructions provided by an instruction queue on the R8000. The queue decoupled the floating-point pipeline from the integer pipeline, implementing a limited form of out-of-order execution by allowing floating-point instructions to execute when possible after or before the integer instructions from the same group are issued. The pipelines were decoupled to help mitigate some of the streaming cache latency.

It contained the floating-point register file, a load queue, a store queue, and two identical floating-point units. All instructions except for divide and square-root are pipelined. The R8010 implements an iterative division and square-root algorithm that uses the multiplier for a key part, requiring the pipeline to be stalled the unit for the duration of the operation.

Arithmetic instructions except for compares have a four-cycle latency. Single and double precision divides have latencies of 14 and 20 cycles, respectively;^[1] and single and double precision square-roots have latencies of 14 and 23 cycles, respectively.^[4]

Streaming cache and Tag RAMs

The streaming cache is an external 1 to 16 MB cache that serves as the R8000's L2 unified cache and the R8010's L1 data cache. It operates at the same clock rate as the R8000 and is built from commodity synchronous static RAMs.^[1] This scheme was used to attain sustained floating point performance, which requires frequent access to data. A small low-latency primary cache would not contain enough data and frequently miss, necessitating long latency refiles that reduce performance.

The streaming cache is two-way interleaved. It has two independent banks, each containing data from even or odd addresses. It can therefore perform two reads, two writes, or a read and a write every cycle, provided that the two accesses are to separate banks.^[1]^[5] Each bank is accessed via two 64-bit unidirectional buses, one for reads, and the other for writes. This scheme was used to avoid bus turnover, which is required by bidirectional buses. By avoiding bus turnover, the cache can be read from in one cycle and then written to in the next cycle without an intervening cycle for turnover, resulting in improved performance.^[5]

The streaming cache's tags are contained on two Tag RAM chips, one for each bank. Both chips contain identical data. Each chip contains 1.189 Mbit of cache tags implemented by four-transistor SRAM cells. The chips are implemented in a 0.7 μm BiCMOS process with two levels of polysilicon and two levels of aluminium interconnect. BiCMOS circuitry was used in the decoders and combined sense amplifier and comparator portions of the chip to reduce cycle time. Each Tag RAM is 14.8 mm by 14.8 mm large, packaged in a 155-pin CPGA, and dissipates 3 W at 75 MHz.^[6] In addition to providing the cache tags, the Tag RAMs are responsible for the streaming cache being four-way set associative. To avoid high a pin count, the cache tags are four-way set associative and logic selects which set to access after lookup instead of the usual way of implementing set-associative caches.^[1]

Access to the streaming cache is pipelined to mitigate some of the latency. The pipeline has five stages: in stage one, addresses are sent to the Tag RAMs, which are accessed in stage two. Stage three is for the signals from the Tag RAMs to propagate to the SSRAMs. In stage four, the SSRAMs are accessed and data is returned to the R8000 or R8010 in stage five.

Physical

The R8000 contained 2.6 million transistors and measured 17.34 mm by 17.30 mm (299.98 mm²). The R8010 contained 830,000 transistors. In total, the two chips contained 3.43 million transistors. Both were fabricated by Toshiba in their VHMOSIII process, a 0.7 μm, triple-layer metal complementary metal–oxide–semiconductor (CMOS) process. Both are packaged in 591-pin ceramic pin grid array (CPGA) packages. Both chips used a 3.3 V power supply, and the R8000 dissipated 13 W at 75 MHz.

Notes

1 2 3 4 5 Hsu 1994
↑ Dongarra 1994
↑ Lavin, Paul (October 1995). "Indy goes to Hollywood". Computer Shopper. pp. 556–559.
↑ MIPS Technologies, Inc., 1994
1 2 MIPS 1994
↑ Unekawa 1993

Related Research Articles

The Pentium is a fifth generation, 32-bit x86 microprocessor that was introduced by Intel on March 22, 1993, as the very first CPU in the Pentium brand. It was instruction set compatible with the 80486 but was a new and very different microarchitecture design from previous iterations. The P5 Pentium was the first superscalar x86 microarchitecture and the world's first superscalar microprocessor to be in mass production—meaning it generally executes at least 2 instructions per clock mainly because of a design-first dual integer pipeline design previously thought impossible to implement on a CISC microarchitecture. Additional features include a faster floating-point unit, wider data bus, separate code and data caches, and many other techniques and features to enhance performance and support security, encryption, and multiprocessing, for workstations and servers when compared to the next best previous industry standard processor implementation before it, the Intel 80486.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

The UltraSPARC is a microprocessor developed by Sun Microsystems and fabricated by Texas Instruments, introduced in mid-1995. It is the first microprocessor from Sun to implement the 64-bit SPARC V9 instruction set architecture (ISA). Marc Tremblay was a co-microarchitect.

The Emotion Engine is a central processing unit developed and manufactured by Sony Computer Entertainment and Toshiba for use in the PlayStation 2 video game console. It was also used in early PlayStation 3 models sold in Japan and North America to provide PlayStation 2 game support. Mass production of the Emotion Engine began in 1999 and ended in late 2012 with the discontinuation of the PlayStation 2.

The AMD Am29000, commonly shortened to 29k, is a family of 32-bit RISC microprocessors and microcontrollers developed and fabricated by Advanced Micro Devices (AMD). Based on the seminal Berkeley RISC, the 29k added a number of significant improvements. They were, for a time, the most popular RISC chips on the market, widely used in laser printers from a variety of manufacturers.

A register file is an array of processor registers in a central processing unit (CPU). Register banking is the method of using a single name to access multiple different physical registers depending on the operating mode. Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports. Such RAMs are distinguished by having dedicated read and write ports, whereas ordinary multiported SRAMs will usually read and write through the same ports.

<span class="mw-page-title-main">POWER3</span> 1998 family of microprocessors by IBM

The POWER3 is a microprocessor, designed and exclusively manufactured by IBM, that implemented the 64-bit version of the PowerPC instruction set architecture (ISA), including all of the optional instructions of the ISA such as instructions present in the POWER2 version of the POWER ISA but not in the PowerPC ISA. It was introduced on 5 October 1998, debuting in the RS/6000 43P Model 260, a high-end graphics workstation. The POWER3 was originally supposed to be called the PowerPC 630 but was renamed, probably to differentiate the server-oriented POWER processors it replaced from the more consumer-oriented 32-bit PowerPCs. The POWER3 was the successor of the P2SC derivative of the POWER2 and completed IBM's long-delayed transition from POWER to PowerPC, which was originally scheduled to conclude in 1995. The POWER3 was used in IBM RS/6000 servers and workstations at 200 MHz. It competed with the Digital Equipment Corporation (DEC) Alpha 21264 and the Hewlett-Packard (HP) PA-8500.

The R10000, code-named "T5", is a RISC microprocessor implementation of the MIPS IV instruction set architecture (ISA) developed by MIPS Technologies, Inc. (MTI), then a division of Silicon Graphics, Inc. (SGI). The chief designers are Chris Rowen and Kenneth C. Yeager. The R10000 microarchitecture is known as ANDES, an abbreviation for Architecture with Non-sequential Dynamic Execution Scheduling. The R10000 largely replaces the R8000 in the high-end and the R4400 elsewhere. MTI was a fabless semiconductor company; the R10000 was fabricated by NEC and Toshiba. Previous fabricators of MIPS microprocessors such as Integrated Device Technology (IDT) and three others did not fabricate the R10000 as it was more expensive to do so than the R4000 and R4400.

The R4000 is a microprocessor developed by MIPS Computer Systems that implements the MIPS III instruction set architecture (ISA). Officially announced on 1 October 1991, it was one of the first 64-bit microprocessors and the first MIPS III implementation. In the early 1990s, when RISC microprocessors were expected to replace CISC microprocessors such as the Intel i486, the R4000 was selected to be the microprocessor of the Advanced Computing Environment (ACE), an industry standard that intended to define a common RISC platform. ACE ultimately failed for a number of reasons, but the R4000 found success in the workstation and server markets.

The R5000 is a 64-bit, bi-endian, superscalar, in-order execution 2-issue design microprocessor, that implements the MIPS IV instruction set architecture (ISA) developed by Quantum Effect Design (QED) in 1996. The project was funded by MIPS Technologies, Inc (MTI), also the licensor. MTI then licensed the design to Integrated Device Technology (IDT), NEC, NKK, and Toshiba. The R5000 succeeded the QED R4600 and R4700 as their flagship high-end embedded microprocessor. IDT marketed its version of the R5000 as the 79RV5000, NEC as VR5000, NKK as the NR5000, and Toshiba as the TX5000. The R5000 was sold to PMC-Sierra when the company acquired QED. Derivatives of the R5000 are still in production today for embedded systems.

The Challenge, code-named Eveready and Terminator, is a family of server computers and supercomputers developed and manufactured by Silicon Graphics in the early to mid-1990s that succeeded the earlier Power Series systems. The Challenge was later succeeded by the NUMAlink-based Origin 200 and Origin 2000 in 1996.

The Alpha 21064 is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture (ISA). It was introduced as the DECchip 21064 before it was renamed in 1994. The 21064 is also known by its code name, EV4. It was announced in February 1992 with volume availability in September 1992. The 21064 was the first commercial implementation of the Alpha ISA, and the first microprocessor from Digital to be available commercially. It was succeeded by a derivative, the Alpha 21064A in October 1993. This last version was replaced by the Alpha 21164 in 1995.

The Alpha 21164, also known by its code name, EV5, is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture (ISA). It was introduced in January 1995, succeeding the Alpha 21064A as Digital's flagship microprocessor. It was succeeded by the Alpha 21264 in 1998.

The Alpha 21264 is a Digital Equipment Corporation RISC microprocessor launched on 19 October 1998. The 21264 implemented the Alpha instruction set architecture (ISA).

The SPARC64 V (Zeus) is a SPARC V9 microprocessor designed by Fujitsu. The SPARC64 V was the basis for a series of successive processors designed for servers, and later, supercomputers.

The PA-8000 (PCX-U), code-named Onyx, is a microprocessor developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC 2.0 instruction set architecture (ISA). It was a completely new design with no circuitry derived from previous PA-RISC microprocessors. The PA-8000 was introduced on 2 November 1995 when shipments began to members of the Precision RISC Organization (PRO). It was used exclusively by PRO members and was not sold on the merchant market. All follow-on PA-8x00 processors are based on the basic PA-8000 processor core.

The R2000 is a 32-bit microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture (ISA). Introduced in January 1986, it was the first commercial implementation of the MIPS architecture and the first commercial RISC processor available to all companies. The R2000 competed with Digital Equipment Corporation (DEC) VAX minicomputers and with Motorola 68000 and Intel Corporation 80386 microprocessors. R2000 users included Ardent Computer, DEC, Silicon Graphics, Northern Telecom and MIPS's own Unix workstations.

The R4600, code-named "Orion", is a microprocessor developed by Quantum Effect Design (QED) that implemented the MIPS III instruction set architecture (ISA). As QED was a design firm that did not fabricate or sell their designs, the R4600 was first licensed to Integrated Device Technology (IDT), and later to Toshiba and then NKK. These companies fabricated the microprocessor and marketed it. The R4600 was designed as a low-end workstation or high-end embedded microprocessor. Users included Silicon Graphics, Inc. (SGI) for their Indy workstation and DeskStation Technology for their Windows NT workstations. The R4600 was instrumental in making the Indy successful by providing good integer performance at a competitive price. In embedded systems, prominent users included Cisco Systems in their network routers and Canon in their printers.

Since 1985, many processors implementing some version of the MIPS architecture have been designed and used widely.

References

"Aligning Instructions for Multiple Dispatch". Sidebar in Pountain, Dick, "The Last Bastion", Byte , August 1994.
Dongarra, Jack J.; Meuer, Hans W. and Strohmaier, Erich (9 November 1994). TOP500 Supercomputer Sites.
Gwennap, Linley (15 February 1993). "SGI Provides Overview of TFP CPU". Microprocessor Report , vol. 7, no. 2.
Gwennap, Linley (23 August 1993). "TFP Designed for Tremendous Floating Point". Microprocessor Report , vol. 7, no. 11.
Hsu, Peter Yan-Tek (2 June 1994). Design of the R8000 Microprocessor ^{[ permanent dead link ]}.
MIPS Technologies, Inc. (August 1994). R8000 Microprocessor Chip Set Product Overview ^{[ permanent dead link ]}.
Pountain, Dick (September 1994). "The Last Bastion". Byte .
Shen, John Paul; Lipasti, Mikko H. (2005). Modern Processor Design: Fundamentals of Superscalar Microprocessors. McGraw-Hill Professional. pp. 418–419.
Unekawa, Yasuo et al. (1993). "A 110MHz / 1Mbit synchronous Tag RAM". Symposium on VLSI Circuits . pp. 15–16.

R8000

Contents

History

Description

R8000

Loads and stores

R8010

Streaming cache and Tag RAMs

Physical

Notes

Related Research Articles

References

Further reading