Transport triggered architecture

Last updated

In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.

Contents

Typically a transport triggered processor has multiple transport buses and multiple functional units connected to the buses, which provides opportunities for instruction level parallelism. The parallelism is statically defined by the programmer. In this respect (and obviously due to the large instruction word width), the TTA architecture resembles the very long instruction word (VLIW) architecture. A TTA instruction word is composed of multiple slots, one slot per bus, and each slot determines the data transport that takes place on the corresponding bus. The fine-grained control allows some optimizations that are not possible in a conventional processor. For example, software can transfer data directly between functional units without using registers.

Transport triggering exposes some microarchitectural details that are normally hidden from programmers. This greatly simplifies the control logic of a processor, because many decisions normally done at run time are fixed at compile time. However, it also means that a binary compiled for one TTA processor will not run on another one without recompilation if there is even a small difference in the architecture between the two. The binary incompatibility problem, in addition to the complexity of implementing a full context switch, makes TTAs more suitable for embedded systems than for general purpose computing.

Of all the one-instruction set computer architectures, the TTA architecture is one of the few that has had processors based on it built, and the only one that has processors based on it sold commercially.

Benefits in comparison to VLIW architectures

TTAs can be seen as "exposed datapath" VLIW architectures. While VLIW is programmed using operations, TTA splits the operation execution to multiple move operations. The low level programming model enables several benefits in comparison to the standard VLIW. For example, a TTA architecture can provide more parallelism with simpler register files than with VLIW. As the programmer is in control of the timing of the operand and result data transports, the complexity (the number of input and output ports) of the register file (RF) need not be scaled according to the worst case issue/completion scenario of the multiple parallel instructions.

An important unique software optimization enabled by the transport programming is called software bypassing. In case of software bypassing, the programmer bypasses the register file write back by moving data directly to the next functional unit's operand ports. When this optimization is applied aggressively, the original move that transports the result to the register file can be eliminated completely, thus reducing both the register file port pressure and freeing a general purpose register for other temporary variables. The reduced register pressure, in addition to simplifying the required complexity of the RF hardware, can lead to significant CPU energy savings, an important benefit especially in mobile embedded systems. [1] [2]

Structure

TTA processors are built of independent function units and register files, which are connected with transport buses and sockets.

Parts of transport triggered architecture Transport Triggered Architecture.png
Parts of transport triggered architecture

Function unit

Each function unit implements one or more operations, which implement functionality ranging from a simple addition of integers to a complex and arbitrary user-defined application-specific computation. Operands for operations are transferred through function unit ports.

Each function unit may have an independent pipeline. In case a function unit is fully pipelined, a new operation that takes multiple clock cycles to finish can be started in every clock cycle. On the other hand, a pipeline can be such that it does not always accept new operation start requests while an old one is still executing.

Data memory access and communication to outside of the processor is handled by using special function units. Function units that implement memory accessing operations and connect to a memory module are often called load/store units.

Control unit

Control unit is a special case of function units which controls execution of programs. Control unit has access to the instruction memory in order to fetch the instructions to be executed. In order to allow the executed programs to transfer the execution (jump) to an arbitrary position in the executed program, control unit provides control flow operations. A control unit usually has an instruction pipeline, which consists of stages for fetching, decoding and executing program instructions.

Register files

Register files contain general purpose registers, which are used to store variables in programs. Like function units, also register files have input and output ports. The number of read and write ports, that is, the capability of being able to read and write multiple registers in a same clock cycle, can vary in each register file.

Transport buses and sockets

Interconnect architecture consists of transport buses which are connected to function unit ports by means of sockets. Due to expense of connectivity, it is usual to reduce the number of connections between units (function units and register files). A TTA is said to be fully connected in case there is a path from each unit output port to every unit's input ports.

Sockets provide means for programming TTA processors by allowing to select which bus-to-port connections of the socket are enabled at any time instant. Thus, data transports taking place in a clock cycle can be programmed by defining the source and destination socket/port connection to be enabled for each bus.

Conditional execution

Some TTA implementations support conditional execution.

Conditional execution is implemented with the aid of guards. Each data transport can be conditionalized by a guard, which is connected to a register (often a 1-bit conditional register) and to a bus. In case the value of the guarded register evaluates to false (zero), the data transport programmed for the bus the guard is connected to is squashed, that is, not written to its destination. Unconditional data transports are not connected to any guard and are always executed.

Branches

All processors, including TTA processors, include control flow instructions that alter the program counter, which are used to implement subroutines, if-then-else, for-loop, etc. The assembly language for TTA processors typically includes control flow instructions such as unconditional branches (JUMP), conditional relative branches (BNZ), subroutine call (CALL), conditional return (RETNZ), etc. that look the same as the corresponding assembly language instructions for other processors.

Like all other operations on a TTA machine, these instructions are implemented as "move" instructions to a special function unit.

TTA implementations that support conditional execution, such as the sTTAck and the first MOVE prototype, can implement most of these control flow instructions as a conditional move to the program counter. [3] [4]

TTA implementations that only support unconditional data transports, such as the Maxim Integrated MAXQ, [5] typically have a special function unit tightly connected to the program counter that responds to a variety of destination addresses. Each such address, when used as the destination of a "move", has a different effect on the program counter—each "relative branch <condition>" instruction has a different destination address for each condition; and other destination addresses are used CALL, RETNZ, etc.

Programming

In more traditional processor architectures, a processor is usually programmed by defining the executed operations and their operands. For example, an addition instruction in a RISC architecture could look like the following.

add r3, r1, r2 

This example operation adds the values of general-purpose registers r1 and r2 and stores the result in register r3. Coarsely, the execution of the instruction in the processor probably results in translating the instruction to control signals which control the interconnection network connections and function units. The interconnection network is used to transfer the current values of registers r1 and r2 to the function unit that is capable of executing the add operation, often called ALU as in Arithmetic-Logic Unit. Finally, a control signal selects and triggers the addition operation in ALU, of which result is transferred back to the register r3.

TTA programs do not define the operations, but only the data transports needed to write and read the operand values. Operation itself is triggered by writing data to a triggering operand of an operation. Thus, an operation is executed as a side effect of the triggering data transport. Therefore, executing an addition operation in TTA requires three data transport definitions, also called moves. A move defines endpoints for a data transport taking place in a transport bus. For instance, a move can state that a data transport from function unit F, port 1, to register file R, register index 2, should take place in bus B1. In case there are multiple buses in the target processor, each bus can be utilized in parallel in the same clock cycle. Thus, it is possible to exploit data transport level parallelism by scheduling several data transports in the same instruction.

An addition operation can be executed in a TTA processor as follows:

r1 -> ALU.operand1 r2 -> ALU.add.trigger ALU.result -> r3 

The second move, a write to the second operand of the function unit called ALU, triggers the addition operation. This makes the result of addition available in the output port 'result' after the execution latency of the 'add'.

The ports associated with the ALU may act as an accumulator, allowing creation of macro instructions that abstract away the underlying TTA:

ldar1; "load ALU": move value to ALU operand 1addr2; add: move value to add triggerstar3; "store ALU": move value from ALU result

Programmer visible operation latency

The leading philosophy of TTAs is to move complexity from hardware to software. Due to this, several additional hazards are introduced to the programmer. One of them is delay slots, the programmer visible operation latency of the function units. Timing is completely the responsibility of the programmer. The programmer has to schedule the instructions such that the result is neither read too early nor too late. There is no hardware detection to lock up the processor in case a result is read too early. Consider, for example, an architecture that has an operation add with latency of 1, and operation mul with latency of 3. When triggering the add operation, it is possible to read the result in the next instruction (next clock cycle), but in case of mul, one has to wait for two instructions before the result can be read. The result is ready for the 3rd instruction after the triggering instruction.

Reading a result too early results in reading the result of a previously triggered operation, or in case no operation was triggered previously in the function unit, the read value is undefined. On the other hand, result must be read early enough to make sure the next operation result does not overwrite the yet unread result in the output port.

Due to the abundance of programmer-visible processor context which practically includes, in addition to register file contents, also function unit pipeline register contents and/or function unit input and output ports, context saves required for external interrupt support can become complex and expensive to implement in a TTA processor. Therefore, interrupts are usually not supported by TTA processors, but their task is delegated to an external hardware (e.g., an I/O processor) or their need is avoided by using an alternative synchronization/communication mechanism such as polling.

Implementations

See also

Related Research Articles

<span class="mw-page-title-main">Central processing unit</span> Central computer component which executes instructions

A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions in the program. This contrasts with external components such as main memory and I/O circuitry, and specialized processors such as graphics processing units (GPUs).

<span class="mw-page-title-main">Intel 8088</span> Intel microprocessor model

The Intel 8088 microprocessor is a variant of the Intel 8086. Introduced on June 1, 1979, the 8088 has an eight-bit external data bus instead of the 16-bit bus of the 8086. The 16-bit registers and the one megabyte address range are unchanged, however. In fact, according to the Intel documentation, the 8086 and 8088 have the same execution unit (EU)—only the bus interface unit (BIU) is different. The original IBM PC is based on the 8088, as are its clones.

In processor design, microcode (μcode) is a technique that interposes a layer of computer organization between the central processing unit (CPU) hardware and the programmer-visible instruction set architecture of a computer. Microcode is a layer of hardware-level instructions that implement higher-level machine code instructions or internal finite-state machine sequencing in many digital processing elements. Microcode is used in general-purpose central processing units, although in current desktop CPUs, it is only a fallback path for cases that the faster hardwired control unit cannot handle.

<span class="mw-page-title-main">Machine code</span> Set of instructions executed directly by a computers central processing unit (CPU)

In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a very specific task, such as a load, a store, a jump, or an arithmetic logic unit (ALU) operation on one or more units of data in the CPU's registers or memory.

In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.

Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.

<span class="mw-page-title-main">Intel 8051</span> Single chip microcontroller series by Intel

The Intel MCS-51 is a single chip microcontroller (MCU) series developed by Intel in 1980 for use in embedded systems. The architect of the Intel MCS-51 instruction set was John H. Wharton. Intel's original versions were popular in the 1980s and early 1990s, and enhanced binary compatible derivatives remain popular today. It is an example of a complex instruction set computer and has separate memory spaces for program instructions and data.

The Intel i860 is a RISC microprocessor design introduced by Intel in 1989. It is one of Intel's first attempts at an entirely new, high-end instruction set architecture since the failed Intel iAPX 432 from the beginning of the 1980s. It was the world's first million-transistor chip. It was released with considerable fanfare, slightly obscuring the earlier Intel i960, which was successful in some niches of embedded systems, and which many considered to be a better design. The i860 never achieved commercial success and the project was terminated in the mid-1990s.

A one-instruction set computer (OISC), sometimes called an ultimate reduced instruction set computer (URISC), is an abstract machine that uses only one instruction – obviating the need for a machine language opcode. With a judicious choice for the single instruction and given infinite resources, an OISC is capable of being a universal computer in the same manner as traditional computers that have multiple instructions. OISCs have been recommended as aids in teaching computer architecture and have been used as computational models in structural computing research. The first carbon nanotube computer is a 1-bit one-instruction set computer.

<span class="mw-page-title-main">PIC microcontrollers</span> Programmable single-chip 16-bit microprocessor for computer

PIC is a family of microcontrollers made by Microchip Technology, derived from the PIC1650 originally developed by General Instrument's Microelectronics Division. The name PIC initially referred to Peripheral Interface Controller, and is currently expanded as Programmable Intelligent Computer. The first parts of the family were available in 1976; by 2013 the company had shipped more than twelve billion individual parts, used in a wide variety of embedded systems.

Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

<span class="mw-page-title-main">Instruction cycle</span> Basic operation cycle of a computer

The instruction cycle is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. It is composed of three main stages: the fetch stage, the decode stage, and the execute stage.

Explicitly parallel instruction computing (EPIC) is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures. It was the basis for Intel and HP development of the Intel Itanium architecture, and HP later asserted that "EPIC" was merely an old term for the Itanium architecture. EPIC permits microprocessors to execute software instructions in parallel by using the compiler, rather than complex on-die circuitry, to control parallel instruction execution. This was intended to allow simple performance scaling without resorting to higher clock frequencies.

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

<span class="mw-page-title-main">History of general-purpose CPUs</span> History of processors used in general purpose computers

The history of general-purpose CPUs is a continuation of the earlier history of computing hardware.

The PDP-11 architecture is a CISC instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). It is implemented by central processing units (CPUs) and microprocessors used in PDP-11 minicomputers. It was in wide use during the 1970s, but was eventually overshadowed by the more powerful VAX-11 architecture in the 1980s.

<span class="mw-page-title-main">Arithmetic logic unit</span> Combinational digital circuit

In computing, an arithmetic logic unit (ALU) is a combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point numbers. It is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs, and graphics processing units (GPUs).

References

  1. V. Guzma, P. Jääskeläinen, P. Kellomäki, and J. Takala, “Impact of Software Bypassing on Instruction Level Parallelism and Register File Traffic”
  2. Johan Janssen. "Compiler Strategies for Transport Triggered Architectures". 2001. p. 168.
  3. Henk Corporaal. "Transport Triggered Architectures examined for general purpose applications". p. 6.
  4. Aliaksei V. Chapyzhenka. "sTTAck: Stack Transport Triggered Architecture".
  5. 1 2 "MAXQ Family User's Guide". Maxim Integrated. Section "1.1 Instruction Set". A register-based, transport-triggered architecture allows all instructions to be coded as simple transfer operations. All instructions reduce to either writing an immediate value to a destination register or memory location or moving data between registers and/or memory locations.
  6. Catsoulis, John (2005), Designing embedded hardware (2 ed.), O'Reilly Media, pp. 327–333, ISBN   978-0-596-00755-3
  7. "Introduction to the MAXQ Architecture". Maxim Integrated. Centralized access to resources (includes transfer map diagram).
  8. Dr. Dobb's article with 32-bit FPGA CPU in Verilog
  9. Web site with more details on the Dr. Dobb's CPU Archived 2013-02-18 at archive.today