Designer | Ivan Godard Mill Computing, Inc. |
---|---|
Bits | 64 |
Design | Belt machine |
Open | No |
Registers | |
33 |
The Mill architecture [1] is a novel belt machine-based computer architecture for general-purpose computing. It has been under development since about 2003 by Ivan Godard and his startup Mill Computing, Inc., formerly named Out Of The Box Computing, in East Palo Alto, California. [2] [3] [4] [5] Mill Computing claims it has a "10x single-thread power/performance gain over conventional out-of-order superscalar architectures" but "runs the same programs, without rewrite". [6]
Mill Computing was founded by persons who formerly worked together on a family of digital signal processors (DSPs), the Philips Trimedia.
The designers claim that the power and cost improvements come by adapting a DSP-like deeply-pipelined processor to general-purpose code. The timing hazards from branches and memory access are said to be handled not via speculative execution, pipelining, and other late binding, but statically-scheduled logic. The claimed improvements in power and area are said to come from eliminating dynamic optimizing hardware: register-renaming, out-of-order execution hazard management, and dynamic cache optimizing. To replace that hardware, each Mill processor is designed to have timing and memory-access behavior predictable to single cycles, so that all the scheduling is subject to a highly-optimizing compiler.
Mill uses a very long instruction word (VLIW)-style encoding to place up to 33 simple operations, a.k.a. opcodes in wide instruction words. These words are not contiguous in memory, but are split into two instruction bundles and placed into two data streams. Each stream is handled by a mostly independent decoder with its own top-level instruction cache.
Instructions are arranged in extended basic blocks, and decoding for both halves begin at the same address in the middle of the block. As the instruction bundles are decoded the program counter in one stream increases and the counter in the other decoder decreases [7]
Each instruction bundle has a header containing the byte count of the instruction, and offset of opcode bundles within. The Mill uses 3 bundles per instruction. Each opcode in the bundle is a fixed bit length, and controls a specific execution slot or pipeline. However each bundle has a variable number of instructions actually encoded. Each execution slot can start speculatively executing its instruction field in the bundle buffer, and if later found to be outside the actual instruction bundle, the slot is stopped and results discarded.
If one of the decoders has no instructions in its stream, a no-op delay can be issued by a small fixed-format data item located in an alignment hole of the instruction bundle in the opposite decoder. This helps maintain code density by reducing the incidence of no-operation codes in Mill code.
Ivan Godard, the designer of Mill, quotes research finding that during operation of a standard register machine, of values stored to processor registers: 6% are accessed never; 80%, once; and only 14%, more than once.
Thus, the Mill uses a novel temporal register addressing scheme, the belt by analogy to a conveyor belt . The operands of the arithmetic logic units (ALUs) and other functional units may be taken from any position on the belt, and the result from the computation is dropped (stored) in the front position of the belt, advancing the belt to make room. As the belt is fixed length, drops in the front are matched by older operands falling off the back; pushed-off operands become inaccessible and must be explicitly saved if still needed for later work. Most operations of the instruction set work only with data on the belt, not on data registers or main memory cells. [8]
For a typical instruction like add
, both argument operands come from explicitly named positions on the belt, and the result is dropped on the front, ready for the next instruction. Operations with multiple results simply drop more values at the belt front. For example, division can produce a quotient and a remainder. Operations with overflow can produce double-wide results. Most belt instructions are encoded as just an operation code (opcode) and two belt positions, with no added fields to specify a result register, memory address, or literal constant. This encoding is easily extended to richer operations with more than two inputs or more than one result. Constant operands are dropped by separate load immediate
instructions. All access of program variables in main random-access memory (RAM) is segregated into separate load
or store
instructions containing one memory address, or some way to calculate that address from belt operands.
All belt machines have variants of the load/store opcodes to access local variables and the heap. This can be by offsets, from a pointer on the belt, or from various special-purpose base registers. Similarly, there will be instructions to branch to an address taken from the belt, along with branches relative to the program counter.
Because each drop of a result moves the prior belt content along to later positions in the queue, a given operand continually changes its position (and hence address) as a result of later execution. In effect, an access to the operand at position zero is a request for the most recent value dropped to the belt, while (for example) a reference to position five is to the sixth most recent drop. Thus the addresses of belt operands reflect the belt history over time. This is temporal addressing. It is hard for human programmers to keep track of belt contents, and hence operand addresses, when writing assembly code for a belt machine. However, it is easy for a compiler to track the changing contents and emit the correct position addresses in generated code.
The belt is fixed length and may be too short to hold all live transient operands before they are pushed off the end. If an operand is needed for longer than its belt lifetime, it must be saved while still on the belt (spill) and later restored to the belt when needed again (fill). This situation is equivalent to the need to spill registers to memory when a program runs out of registers in a general-register machine. Spilled operands may be written to memory using normal store instructions, and restored using normal load instructions, or spill and fill may use special-purpose storage and associated operations that are faster or offer other advantages over load and store.
The operands on the belt are read-only. New results do not overwrite prior values. The belt is thus a single-assignment structure, and is immune to the data hazards that must be dealt with by modern out-of-order general-register machines.
Dense machine code was very valuable in the 1960s, when main memory was very costly and limited, even on mainframe computers. It became important again on the initially-tiny memories of minicomputers, and then microprocessors. Density remains important today, for applications for smartphone, or downloaded into browsers over slow Internet connections, and in read-only memory (ROM) for embedded applications. A more general advantage of increased density is improved effectiveness of caches and instruction prefetch.
Belt machines have smaller instructions than register-based machines, due to not needing a destination address for results. This saving can make a significant difference for fixed-length instruction formats, which normally use power-of-two instruction widths. If there are thirty-two addressable elements (registers on a general-register machine, belt positions on a belt machine), then each element address occupies five bits in the instruction, needing 15 bits for the three-address format of a general-register machine, but only 10 bits using the two-address format of a belt machine. Because bits are also needed for opcode and other information in the instruction, the (power-of-two constrained) instruction width often determines the maximum number of addressable elements possible in a design. Typically a belt machine instruction can support the encoding of double the number of addressable elements compared to a general-register machine of the same instruction width. There are similar gains in variable-length instruction encodings.
Overall, belt machine code is less compact than for stack machines, which use no operand addresses, but often must introduce stack-manipulation instructions unneeded on a belt machine. The instructions for accumulator machines are not padded out with multiple register fields, instead, they use the return stack and need no extra memory reference instructions.
While a belt machine presents an operand queue as the program model, the mill architecture doesn't implement the belt as a physical queue (shift register) in the implemented hardware. Instead it is a semantic representation of the bypass network present in most fast computers, which intercepts pipelined accesses to registers, routing them directly to the execution units that need the result. The number of registers is reasonably small: those needed to pipeline the output of each functional unit, and one for each possible belt item. The small number of registers reduces the size, power and complexity of the network to access the registers. Live data values are kept in conveniently addressable physical resources (individual registers, register files, static random-access memory (SRAM), or operand forwarding from functional units) and generally not moved for the duration of their belt lifetime. Instruction decoder maps logical belt positions to physical locations. The mapping is updated to reflect the changes of logical position arising from newly dropped results.
A patent US 9513921 on the belt was granted in 2016.
Depending on the type and success of load operations, the Mill also assigns metadata to each belt item, including status, width, and vectorization count. Operations operate on the item described. Thus, the width and vector count are not part of the instruction coding. If an operation fails, the failure information is hashed, and placed in the destination, with its metadata, for use in debugging.
The Mill also uses the metadata to assist speculative execution and pipelining. For example, if a vector load operation fails (e.g., part of it leaves a protection boundary) those parts of that belt entry will be marked as not a result
(NaR) in the metadata. This allows speculatively-executed vector code to emulate per-vector-item fault behavior. The NaR items create a fault only if an attempt occurs to store them or perform other non-speculative code on them. If they are never used, no fault is ever created.
The Mill's architecture appears able to reduce the size and complexity of pipelined loop code. In the pipeline video, every operation was required to cope with a special operand value called None
(not to be confused with not a number
in floating point formats), which has special semantics for this purpose: operations where at least one argument is a None
generally produce a None
as output, and when a None
is attempted to be stored to memory, that store (or portion of a store for vectors where only some elements are None
) is ignored, leaving that memory location undisturbed. This special None
value is not implemented as a reserved bit pattern, but by using the extra metadata bits that are associated with each belt item. In the first few iterations of a pipelined loop, the code drops a group of Nones
on the belt using a special retire
operation, which tells the CPU how many items should drop in that cycle (that is, however many real items drop onto the belt from previous scheduled operations, retire
drops only enough additional Nones
to bring the total number of drops that cycle to the requested amount - once steady state is reached, generally no Nones
will be generated). This way, the belt always has the expected number of new items for the usual steady-state loop body to operate on, with the Nones
acting as placeholders for data that isn't ready. As the operations scheduled from previous iterations of the loop begin dropping results, the belt therefore starts each new loop iteration with more real data items and fewer placeholder Nones
(the ordering rules for simultaneous drops from different-latency operations and retire
ensure that as more real results appear in new iterations, they always occupy a position that had a None
in all previous iterations, so that each operation can use the same input belt number for all iterations). Meanwhile, the store operations in the loop body receive None
values until the loop's steady state is reached, and therefore have no effect until real results are available for storing. Thus, the loop body that handles the steady-state of the pipeline code, which includes appropriate retire
operations, acts as its own prologue code. The processing of the final elements through the pipeline can generally be finished by having this same loop deliberately execute extra iterations, such that the remaining scheduled operations have time to finish and be stored to memory, because nearly all operations have no side effects (attempted invalid memory reads merely produce a NaR
value on the belt, which does not cause a fault unless it is then used by a store or flow control operation).
To pipeline nested loops, the Mill treats each loop almost like a subroutine call, with automatic saves and restores of appropriate state (belt and scratchpad).
Mill instructions are phased, with the up to 33 operations in a single instruction word being issued over three clock cycles. Mill phasing can capture very short traces and data-flows in a single instruction, and is claimed to improve the available instruction-level parallelism, particularly around control flow. Each phase is overlapped with different phases from neighboring instructions. The phases are also closely tied to the decode bundle arrangement, and allow the decode hardware to be simpler and pipe-lined.
Within an instruction, the reader phase occurs first. These are operations that require no inputs and create output that is available the next cycle. These drop values directly from the instruction stream or from reading static byte addresses of the scratchpad.
Next is the op or compute phase that takes inputs from the belt and drops values onto the belts, such as loads or arithmetic operations. Outputs from the compute phase may take several cycles to retire, dropping onto the belt with hardware-enforced static latency and order.
Then writer phase reads value from the belt and changes global state, but not does not create belt values. Stores and branches occur here, as well as writes to scratchpad addresses.
There are a few other conceptual phases that aren't part of the 3-cycle skew. The pick operation is equivalent to the ternary if operator (?:) and is implemented on bypass network control between the compute and writer phase and adds no computational delay. The Call phase is implemented in the same place, and the hardware saves/restores state such that from the program model no cycles have occurred in the callee, even though many real cycles may have passed until the return.
There are several versions of the Mill processor in development, spanning Tin (low-end uses) to Gold (high-performance uses). The company estimates that dual-core Gold chips implemented with 28 nm lithography may work at 1.2 GHz with a typical thermal design power (TDP) of 28 watts and performance of 79 billion operations per second (GIPS). [7]
Different versions of the Mill are intended for different markets, and are said to have different instruction set architectures, different numbers of execution units, different pipeline timings, and thus, very different binaries. To accommodate these, compilers are required to emit a specification which is then recompiled into an executable binary by a recompiler supplied by the Mill Computing company. In this way, code that can be distributed is adapted to specifics of the exact model's pipeline, binary coding, etc.
The development of so many tool sets and processor designs may be impractically costly. Ivan Godard said that Mill's plan is to develop software tools that accept a specification for a Mill processor, and then write the software tools (assembler, compiler backend, and simulator), and the Verilog describing the CPU. In a demo video, Mill claimed to show early versions of the software to create an assembler and simulator. The bulk of the compiler is said to be a port of LLVM. As of 2014 [update] , it is incomplete.
In computer programming, assembly language, often abbreviated asm, is any low-level programming language in which there is a very strong correspondence between the instructions in the language and the architecture's machine code instructions. Because assembly depends on the machine code instructions, every assembly language is designed for exactly one specific computer architecture. Assembly language may also be called symbolic machine code.
A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry within a computer that executes instructions that make up a computer program. The CPU performs basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions in the program. This contrasts with external components such as main memory and I/O circuitry, and specialized processors such as graphics processing units (GPUs).
Microcode is a computer hardware technique that interposes a layer of organisation between the CPU hardware and the programmer-visible instruction set architecture of the computer. As such, the microcode is a layer of hardware-level instructions that implement higher-level machine code instructions or internal state machine sequencing in many digital processing elements. Microcode is used in general-purpose central processing units, although in current desktop CPUs, it is only a fallback path for cases that the faster hardwired control unit cannot handle.
In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power consumption.
In computer science, threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It is often used in compilers, which may generate code in that form or be implemented in that form themselves. The code may be processed by an interpreter or it may simply be a sequence of machine code call instructions.
In computer science, an instruction set architecture (ISA) is an abstract model of a computer. It is also referred to as architecture or computer architecture. A realization of an ISA, such as a central processing unit (CPU), is called an implementation.
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the later 1990s.
The Intel MCS-51 is a single chip microcontroller (MCU) series developed by Intel in 1980 for use in embedded systems. The architect of the Intel MCS-51 instruction set was John H. Wharton. Intel's original versions were popular in the 1980s and early 1990s and enhanced binary compatible derivatives remain popular today. It is an example of a complex instruction set computer, and has separate memory spaces for program instructions and data.
In computer science, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.
In computing, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Beside the opcode itself, most instructions also specify the data they will process, in the form of operands. In addition to opcodes used in the instruction set architectures of various CPUs, which are hardware devices, they can also be used in abstract computing machines as part of their byte code specifications.
In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.
In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. While a programmer in assembly language refers for instance to a logical register accu
, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.
In computer science, computer engineering and programming language implementations, a stack machine is a mode of computation where executive control is maintained wholly through append (push), readoff and truncation (pop), of a first-in-last-out memory buffer, known as a stack, requiring very few processor registers. A stack machine is sufficient to coordinate operation of an entire computer and operating system, for example the Burroughs B5000, may define a particular software program, for example the interpreter for Adobe's PostScript print formatting language, or may be used in only part of the execution thread of a program.
The instruction cycle is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. It is composed of three main stages: the fetch stage, the decode stage, and the execute stage.
Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.
In computer engineering, an orthogonal instruction set is an instruction set architecture where all instruction types can use all addressing modes. It is "orthogonal" in the sense that the instruction type and the addressing mode vary independently. An orthogonal instruction set does not impose a limitation that requires a certain instruction to use a specific register so there is little overlapping of instruction functionality.
In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction-set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.
In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.
The Little Man Computer (LMC) is an instructional model of a computer, created by Dr. Stuart Madnick in 1965. The LMC is generally used to teach students, because it models a simple von Neumann architecture computer—which has all of the basic features of a modern computer. It can be programmed in machine code or assembly code.
In computing, an arithmetic logic unit (ALU) is a combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point numbers. It is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs, and graphics processing units (GPUs).