TRIPS architecture

Last updated
Processor TRIPS. Trips floorplan.jpg
Processor TRIPS.

TRIPS was a microprocessor architecture designed by a team at the University of Texas at Austin in conjunction with IBM, Intel, and Sun Microsystems. TRIPS uses an instruction set architecture designed to be easily broken down into large groups of instructions (Graphs) that can be run on independent processing elements. The design collects related data into the graphs, attempting to avoid expensive data reads and writes and keeping the data in high speed memory close to the processing elements. The prototype TRIPS processor contains 16 such elements. TRIPS hoped to reach 1 TFLOP on a single processor per papers published from 2003 to 2006. [1]

Contents

EDGE

TRIPS is a processor based on the Explicit Data Graph Execution (EDGE) concept. EDGE attempts to bypass certain performance bottlenecks that have come to dominate modern systems. [2]

EDGE is based on the processor being able to better understand the instruction stream being sent to it, not seeing it as a linear stream of individual instructions, but rather blocks of instructions related to a single task using isolated data. EDGE attempts to run all of these instructions as a block, distributing them internally along with any data they need to process. [3] The compilers examine the code and find blocks of code that share information in a specific way. These are then assembled into compiled "hyper-blocks" and fed into the CPU. Since the compiler is guaranteeing that these blocks have specific inter-dependencies between them, the processor can isolate the code in a single functional unit with its own local memory.

For example, with a program that adds two numbers from memory, then adds that result to another value in memory, a traditional processor would have to notice the dependency and schedule the instructions to run one after the other, storing the intermediate results in the registers. In an EDGE processor, the inter-dependencies between the data in the code would be noticed by the compiler, which would compile these instructions into a single block. That block would then be fed, along with all the data it needed to complete, into a single functional unit and its own private set of registers. This ensures that no additional memory fetching is required, as well as keeping the registers physically close to the functional unit that needs those values.

Code that did not rely on this intermediate data would be compiled into separate hyper-blocks. Of course its possible that an entire program would use the same data, so the compilers also look for instances where data is handed off to other code and then effectively abandoned by the original block, which is a common access pattern. In this case the compiler will still produce two separate hyper-blocks, but explicitly encode the hand-off of the data rather than simply leaving it stored in some shared memory location. In doing so, the processor can "see" these communications events and schedule them to run in proper order. Blocks that have considerable inter-dependencies are re-arranged by the compiler to spread out the communications in order to avoid bottle-necking the transport.

This greatly increased the isolation of the individual functional units. EDGE processors are limited in parallelism by the capabilities of the compiler, not the on-chip systems. Whereas modern processors are reaching a plateau at four-wide parallelism, EDGE designs can scale out much wider. They can also scale "deeper" as well, handing off blocks from one unit to another in a chain that is scheduled to reduce the contention due to shared values.

TRIPS

University of Texas at Austin's implementation of the EDGE concept is the TRIPS processor, the Tera-op, Reliable, Intelligently adaptive Processing System. A TRIPS CPU is built by repeating a single basic functional unit as many times as needed. The TRIPS design's use of hyper-blocks that are loaded en-masse allows for dramatic gains in speculative execution. Whereas a traditional design might have a few hundred instructions to examine for possible scheduling into the functional units, the TRIPS design has thousands, hundreds of instructions per hyper-block, and hundreds of hyper-blocks being examined. This leads to greatly improved functional unit utilization; scaling its performance to a typical four-issue superscalar design, TRIPS can process about three times as many instructions per cycle.

Whilst traditional designs uses varying units, allowing more parallelism than the four-wide schedulers would otherwise allow, TRIPS, in order to keep all of the units active includes all types of instructions in the instruction stream. As this is often not the case in practice, traditional CPUs often have many idle functional units. In TRIPS, the individual units are general purpose, allowing any instruction to run on any core. Not only does this avoid the need to carefully balance the number of different kinds of cores, but it also means that a TRIPS design can be built with any number of cores needed to reach a particular performance requirement. A single-core TRIPS CPU, with a simplified (or eliminated) scheduler will run a set of hyperblocks exactly like one with hundreds of cores, only slower.

The performance is also not dependent on the types of data being fed in, meaning that a TRIPS CPU will run a much wider variety of tasks at the same level performance. For instance, if a traditional CPU is fed a math-heavy workload, it will bog as soon as all the floating point units are busy, with the integer units lying idle. If it is fed a data intensive program like a database job, the floating point units will lie idle while the integer units bog. In a TRIPS CPU every functional unit will add to the performance of every task, because every task can run on every unit. The designers refer to as a "polymorphic processor".

Like TRIPS, DSPs gain additional performance by limiting data inter-dependencies, but unlike TRIPS they do so by allowing only a very limited workflow to run on them. TRIPS would be just as fast as a custom DSP on these workloads, but equally able to run other workloads at the same time. Though a TRIPS processor likely couldn't be used to replace highly customized designs like GPUs in modern graphics cards, they may be able to replace or outperform many lower-performance chips like those used for media processing.

The reduction of the global register file also results in non-obvious gains. The addition of new circuitry to modern processors has meant that their overall size has remained about the same even as they move to smaller process sizes. As a result, the relative distance to the register file has grown, and this limits the possible cycle speed due to communications delays. In EDGE the data is generally more local or isolated in well defined inter-core links, eliminating large "cross-chip" delays. This means the individual cores can be run at higher speeds, limited by the signalling time of the much shorter data paths.

The combination of these two design changes effects greatly improves system performance. However, as of 2008, GPUs from ATI and NVIDIA have already exceeded the 1 teraflop barrier (albeit for specialized applications). As for traditional CPUs, a contemporary (2007) Mac Pro using a 2-core Intel Xeon can only perform about 5 GFLOPs on single applications. [4]

In 2003, the TRIPS team started implementing a prototype chip. Each chip has two complete cores, each one with 16 functional units in a four-wide, four-deep arrangement. In the current implementation, the compiler constructs "hyperblocks" of 128 instructions each, and allows the system to keep eight blocks "in flight" at the same time, for a total of 1,024 instructions per core. The basic design can include up to 32 chips interconnected, approaching 500 GFLOPS. [5]

Related Research Articles

<span class="mw-page-title-main">Central processing unit</span> Central computer component which executes instructions

A central processing unit (CPU), also called a central processor, main processor, or just processor, is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

<span class="mw-page-title-main">Process (computing)</span> Particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

Very long instruction word (VLIW) refers to instruction set architectures that are designed to exploit instruction-level parallelism (ILP). A VLIW processor allows programs to explicitly specify instructions to execute in parallel, whereas conventional central processing units (CPUs) mostly allow programs to specify instructions to execute in sequence only. VLIW is intended to allow higher performance without the complexity inherent in some other designs.

<span class="mw-page-title-main">Transputer</span> Series of pioneering microprocessors from the 1980s

The transputer is a series of pioneering microprocessors from the 1980s, intended for parallel computing. To support this, each transputer had its own integrated memory and serial communication links to exchange data with other transputers. They were designed and produced by Inmos, a semiconductor company based in Bristol, United Kingdom.

<span class="mw-page-title-main">Parallel computing</span> Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

<span class="mw-page-title-main">Instruction-level parallelism</span> Ability of computer instructions to be executed simultaneously with correct results

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

In computer science, instruction scheduling is a compiler optimization used to improve instruction-level parallelism, which improves performance on machines with instruction pipelines. Put more simply, it tries to do the following without changing the meaning of the code:

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

In computer science, stream processing is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation. The software stack for these systems includes components such as programming models and query languages, for expressing computation; stream management systems, for distribution and scheduling; and hardware components for acceleration including floating-point units, graphics processing units, and field-programmable gate arrays.

MAJC was a Sun Microsystems multi-core, multithreaded, very long instruction word (VLIW) microprocessor design from the mid-to-late 1990s. Originally called the UltraJava processor, the MAJC processor was targeted at running Java programs, whose "late compiling" allowed Sun to make several favourable design decisions. The processor was released into two commercial graphical cards from Sun. Lessons learned regarding multi-threads on a multi-core processor provided a basis for later OpenSPARC implementations such as the UltraSPARC T1.

<span class="mw-page-title-main">Multi-core processor</span> Microprocessor with more than one processing unit

A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions but the single processor can run instructions on separate cores at the same time, increasing overall speed for programs that support multithreading or other parallel computing techniques. Manufacturers typically integrate the cores onto a single integrated circuit die or onto multiple dies in a single chip package. The microprocessors currently used in almost all personal computers are multi-core.

PRISM was Apollo Computer's high-performance CPU used in their DN10000 series workstations. It was for some time the fastest microprocessor available, a high fraction of a Cray-1 in a workstation. Hewlett-Packard purchased Apollo in 1989, ending development of PRISM, although some of PRISM's ideas were later used in HP's own HP-PA Reduced instruction set computer (RISC) and Itanium processors.

<span class="mw-page-title-main">History of general-purpose CPUs</span>

The history of general-purpose CPUs is a continuation of the earlier history of computing hardware.

The Sieve C++ Parallel Programming System is a C++ compiler and parallel runtime designed and released by Codeplay that aims to simplify the parallelization of code so that it may run efficiently on multi-processor or multi-core systems. It is an alternative to other well-known parallelisation methods such as OpenMP, the RapidMind Development Platform and Threading Building Blocks (TBB).

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

Explicit data graph execution, or EDGE, is a type of instruction set architecture (ISA) which intends to improve computing performance compared to common processors like the Intel x86 line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

Stream Processors, Inc. (SPI), was a Silicon Valley-based fabless semiconductor company specializing in the design and manufacture of high-performance digital signal processors for applications including video surveillance, multi-function printers and video conferencing. The company ceased operations in 2009.

<span class="mw-page-title-main">Kepler (microarchitecture)</span> GPU microarchitecture by Nvidia

Kepler is the codename for a GPU microarchitecture developed by Nvidia, first introduced at retail in April 2012, as the successor to the Fermi microarchitecture. Kepler was Nvidia's first microarchitecture to focus on energy efficiency. Most GeForce 600 series, most GeForce 700 series, and some GeForce 800M series GPUs were based on Kepler, all manufactured in 28 nm. Kepler found use in the GK20A, the GPU component of the Tegra K1 SoC, and in the Quadro Kxxx series, the Quadro NVS 510, and Tesla computing modules.

References