The control unit (CU) is a component of a computer's central processing unit (CPU) that directs the operation of the processor. It tells the computer's memory, arithmetic and logic unit and input and output devices how to respond to the instructions that have been sent to the processor.
It directs the operation of the other units by providing timing and control signals. Most computer resources are managed by the CU. It directs the flow of data between the CPU and the other devices. John von Neumann included the control unit as part of the von Neumann architecture.In modern computer designs, the control unit is typically an internal part of the CPU with its overall role and operation unchanged since its introduction.
The simplest computers use a multicycle microarchitecture. These were the earliest designs. They are still popular in the very smallest computers, such as the embedded systems that operate machinery.
In a multicycle computer, the control unit often steps through the Von Neumann Cycle: Fetch the instruction, Fetch the operands, do the instruction, write the results. When the next instruction is placed in the control unit, it changes the behavior of the control unit to finish the instruction correctly. So, the bits of the instruction directly control the control unit, which in turn controls the computer.
The control unit may include a binary counter to tell the control unit's logic what step it should do.
Multicycle control units typically use both the rising and falling edges of their square-wave timing clock. They operate a step of their operation on each edge of the timing clock, so that a four-step operation completes in two clock cycles.
Many computers have two different types of unexpected events. An interrupt occurs because some type of input or output needs software attention in order to operate correctly. An exception is caused by the computer's operation. One crucial difference is that the timing of an interrupt cannot be predicted. Another is that some exceptions (e.g. a memory-not-available exception) can be caused by an instruction that needs to be restarted.
Control units can be designed to handle interrupts in one of two typical ways. If a quick response is most important, a control unit is designed to abandon work to handle the interrupt. In this case, the work in process will be restarted after the last completed instruction. If the computer is to be very inexpensive, very simple, very reliable, or to get more work done, the control unit will finish the work in process before handling the interrupt. Finishing the work is inexpensive, because it needs no register to record the last finished instruction. It is simple and reliable because it has the least number of states. It also wastes the least amount of work.
Exceptions can be made to operate like interrupts in very simple computers. If virtual memory is required, then a memory-not-available exception must retry the failing instruction.
It is common for multicycle computers to use more cycles. Sometimes it takes longer to take a conditional jump, because the program counter has to be reloaded. Sometimes they do multiplication or division instructions by a process something like binary long multiplication and division. Very small computers might do arithmetic one or a few bits at a time. Some computers have very complex instructions that take many steps.
Many medium-complexity computers pipeline instructions. This design is popular because of its economy and speed.
In a pipelined computer, instructions flow through the computer. This design has several stages. For example, it might have one stage for each step of the Von Neumann cycle. A pipelined computer usually has "pipeline registers" after each stage. These store the bits calculated by a stage so that the logic gates of the next stage can use the bits to do the next step. It is common for even numbered stages to operate on one edge of the square-wave clock, while odd-numbered stages operate on the other edge.
In a pipelined computer, the control unit arranges for the flow to start, continue, and stop as a program commands. The instruction data is usually passed in pipeline registers from one stage to the next, with a somewhat separated piece of control logic for each stage. The control unit also assures that the instruction in each stage does not harm the operation of instructions in other stages. For example, if two stages must use the same piece of data, the control logic assures that the uses are done in the correct sequence.
When operating efficiently, a pipelined computer will have an instruction in each stage. It is then working on all of those instructions at the same time. It can finish about one instruction for each cycle of its clock. When a program makes a decision, and switches to a different sequence of instructions, the pipeline sometimes must discard the data in process and restart. This is called a "stall." When two instructions could interfere, sometimes the control unit must stop processing a later instruction until an earlier instruction completes. This is called a "pipeline bubble" because a part of the pipeline is not processing instructions. Pipeline bubbles can occur when two instructions operate on the same register.
Interrupts and unexpected exceptions also stall the pipeline. If a pipelined computer abandons work for an interrupt, more work is lost than in a multicycle computer. Predictable exceptions do not need to stall. For example, if an exception instruction is used to enter the operating system, it does not cause a stall.
Speed? For the same speed of electronic logic, it can do more instructions per second than a multicycle computer. Also, even though the electronic logic has a fixed maximum speed, a pipelined computer can be made faster or slower by varying the number of stages in the pipeline. With more stages, each stage does less work, and so the stage has fewer delays from the logic gates.
Economy? A pipelined model of a computer often has the least logic gates per instruction per second, less than either a multicycle or out-of-order computer. Why? The average stage is less complex than a multicycle computer. An out of order computer usually has large amounts of idle logic at any given instant. Similar calculations usually show that a pipelined computer uses less energy per instruction.
However, a pipelined computer is usually more complex and more costly than a comparable multicycle computer. It typically has more logic gates, registers and a more complex control unit. In a like way, it might use more total energy, while using less energy per instruction. Out of order CPUs can usually do more instructions per second because they can do several instructions at once.
Control units use many methods to keep a pipeline full and avoid stalls. For example, even simple control units can assume that a backwards branch, to a lower-numbered, earlier instruction, is a loop, and will be repeated.So, a control unit with this design will always fill the pipeline with the backwards branch path. If a compiler can detect the most frequently-taken direction of a branch, the compiler can just produce instructions so that the most frequently taken branch is the preferred direction of branch. In a like way, a control unit might get hints from the compiler: Some computers have instructions that can encode hints from the compiler about the direction of branch.
Some control units do branch prediction: A control unit keeps an electronic list of the recent branches, encoded by the address of the branch instruction.This list has a few bits for each branch to remember the direction that was taken most recently.
Some control units can do speculative execution, in which a computer might have two or more pipelines, calculate both directions of a branch,then discard the calculations of the unused direction.
Results from memory can become available at unpredictable times because very fast computers cache memory. That is, they copy limited amounts of memory data into very fast memory. The CPU must be designed to process at the very fast speed of the cache memory. Therefore, the CPU might stall when it must access main memory directly. In modern PCs, main memory is as much as three hundred times slower than cache.
To help this, out-of-order CPUs and control units were developed to process data as it becomes available. (See next section)
But what if all the calculations are complete, but the CPU is still stalled, waiting for main memory? Then, a control unit can switch to an alternative thread of execution whose data has been fetched while the thread was idle. A thread has its own program counter, a stream of instructions and a separate set of registers. Designers vary the number of threads depending on current memory technologies and the type of computer. Typical computers such as PCs and smart phones usually have control units with a few threads, just enough to keep busy with affordable memory systems. Database computers often have about twice as many threads, to keep their much larger memories busy. Graphic processing units (GPUs) usually have hundreds or thousands of threads, because they have hundreds or thousands of execution units doing repetitive graphic calculations.
When a control unit permits threads, the software also has to be designed to handle them. In general-purpose CPUs like PCs and smartphones, the threads are usually made to look very like normal time-sliced processes. At most, the operating system might need some awareness of them. In GPUs, the thread scheduling usually cannot be hidden from the application software, and is often controlled with a specialized subroutine library.
A control unit can be designed to finish what it can. If several instructions can be completed at the same time, the control unit will arrange it. So, the fastest computers can process instructions in a sequence that can vary somewhat, depending on when the operands or instruction destinations become available. Most supercomputers and many PC CPUs use this method. The exact organization of this type of control unit depends on the slowest part of the computer.
When the execution of calculations is the slowest, instructions flow from memory into pieces of electronics called "issue units." An issue unit holds an instruction until both its operands and an execution unit are available. Then, the instruction and its operands are "issued" to an execution unit. The execution unit does the instruction. Then the resulting data is moved into a queue of data to be written back to memory or registers. If the computer has multiple execution units, it can usually do several instructions per clock cycle.
It is common to have specialized execution units. For example, a modestly priced computer might have only one floating-point execution unit, because floating point units are expensive. The same computer might have several integer units, because these are relatively inexpensive, and can do the bulk of instructions.
One kind of control unit for issuing uses an array of electronic logic, a "scoreboard"that detects when an instruction can be issued. The "height" of the array is the number of execution units, and the length and width are the possible sources of operands. When all the items come together, the signals from the operands and execution unit will cross. The logic at this intersection detects that the instruction can work, so the instruction is "issued" to the free execution unit. An alternative style of issuing control unit implements the Tomasulo algorithm, which reorders a hardware queue of instructions. In some sense, both styles utilize a queue. The scoreboard is an alternative way to encode and reorder a queue of instructions, and some designers call it a queue table.
With some additional logic, a scoreboard can compactly combine execution reordering, register renaming and precise exceptions and interrupts. Further it can do this without the power-hungry, complex content-addressable memory used by the Tomasulo algorithm.
If the execution is slower than writing the results, the memory write-back queue always has free entries. But what if the memory writes slowly? Or what if the destination register will be used by an "earlier" instruction that has not yet issued? Then the write-back step of the instruction might need to be scheduled. This is sometimes called "retiring" an instruction. In this case, there must be scheduling logic on the back end of execution units. It schedules access to the registers or memory that will get the results.
Retiring logic can also be designed into an issuing scoreboard or a Tomasulo queue, by including memory or register access in the issuing logic.
Out of order controllers require special design features to handle interrupts. When there are several instructions in progress, it is not clear where in the instruction stream an interrupt occurs. For input and output interrupts, almost any solution works. However when a computer has virtual memory, an interrupt occurs to indicate that a memory access failed. This memory access must be associated with an exact instruction and an exact processor state, so that the processor's state can be saved and restored by the interrupt. A usual solution preserves copies of registers until a memory access completes.
Also, out of order CPUs have even more problems with stalls from branching, because they can complete several instructions per clock cycle, and usually have many instructions in various stages of progress. So, these control units might use all of the solutions used by pipelined processors.
Some computers translate each single instruction into a sequence of simpler instructions. The advantage is that an out of order computer can be simpler in the bulk of its logic, while handling complex multi-step instructions. x86 Intel CPUs since the Pentium Pro translate complex CISC x86 instructions to more RISC-like internal micro-operations.
In these, the "front" of the control unit manages the translation of instructions. Operands are not translated. The "back" of the CU is an out-of-order CPU that issues the micro-operations and operands to the execution units and data paths.
Many modern computers have controls that minimize power usage. In battery-powered computers, such as those in cell-phones, the advantage is longer battery life. In computers with utility power, the justification is to reduce the cost of power, cooling or noise.
Most modern computers use CMOS logic. CMOS wastes power in two common ways: By changing state, i.e. "active power," and by unintended leakage. The active power of a computer can be reduced by turning off control signals. Leakage current can be reduced by reducing the electrical pressure, the voltage, making the transistors with larger depletion regions or turning off the logic completely.
Active power is easier to reduce because data stored in the logic is not affected. The usual method reduces the CPU's clock rate. Most computer systems use this method. It's common for a CPU to idle during the transition to avoid side-effects from the changing clock.
Most computers also have a "halt" instruction. This was invented to stop non-interrupt code so that interrupt code has reliable timing. However, designers soon noticed that a halt instruction was also a good time to turn off a CPU's clock completely, reducing the CPU's active power to zero. The interrupt controller might continue to need a clock, but that usually uses much less power than the CPU.
These methods are relatively easy to design, and became so common that others were invented for commercial advantage. Many modern low-power CMOS CPUs stop and start specialized execution units and bus interfaces depending on the needed instruction. Some computerseven arrange the CPU's microarchitecture to use transfer-triggered multiplexers so that each instruction only utilises the exact pieces of logic needed.
Theoretically, computers at lower clock speeds could also reduce leakage by reducing the voltage of the power supply. This affects the reliability of the computer in many ways, so the engineering is expensive, and it is uncommon except in relatively expensive computers such as PCs or cellphones.
Some designs can use very low leakage transistors, but these usually add cost. The depletion barriers of the transistors can be made larger to have less leakage, but this makes the transistor larger and thus both slower and more expensive. Some vendors use this technique in selected portions of an IC by constructing low leakage logic from large transistors that some processes provide for analog circuits. Some processes place the transistors above the surface of the silicon, in "fin fets", but these processes have more steps, so are more expensive. Special transistor doping materials (e.g. hafnium) can also reduce leakage, but this adds steps to the processing, making it more expensive. Some semiconductors have a larger band-gap than silicon. However, these materials and processes are currently (2020) more expensive than silicon.
Managing leakage is more difficult, because before the logic can be turned-off, the data in it must be moved to some type of low-leakage storage.
One common method is to spread the load to many CPUs, and turn off unused CPUs as the load reduces. The operating system's task switching logic saves the CPUs' data to memory. In some cases,one of the CPUs can be simpler and smaller, literally with fewer logic gates. So, it has low leakage, and it is the last to be turned off, and the first to be turned on. Also it then is the only CPU that requires special low-power features. A similar method is used in most PCs, which usually have an auxiliary embedded CPU that manages the power system. However in PCs, the software is usually in the BIOS, not the operating system.
Some CPUsmake use of a special type of flip-flop (to store a bit) that couples a fast, high-leakage storage cell to a slow, large (expensive) low-leakage cell. These two cells have separated power supplies. When the CPU enters a power saving mode (e.g. because of a halt that waits for an interrupt), data is transferred to the low-leakage cells, and the others are turned off. When the CPU leaves a low-leakage mode (e.g. because of an interrupt), the process is reversed.
Older designs would copy the CPU state to memory, or even disk, sometimes with specialized software. Very simple embedded systems sometimes just restart.
All modern CPUs have control logic to attach the CPU to the rest of the computer. In modern computers, this is usually a bus controller. When an instruction reads or writes memory, the control unit either controls the bus directly, or controls a bus controller. Many modern computers use the same bus interface for memory, input and output. This is called "memory-mapped I/O". To a programmer, the registers of the I/O devices appear as numbers at specific memory addresses. x86 PCs use an older method, a separate I/O bus accessed by I/O instructions.
A modern CPU also tends to include an interrupt controller. It handles interrupt signals from the system bus. The control unit is the part of the computer that responds to the interrupts.
There is often a cache controller to cache memory. The cache controller and the associated cache memory is often the largest physical part of a modern, higher-performance CPU. When the memory, bus or cache is shared with other CPUs, the control logic must communicate with them to assure that no computer ever gets out-of-date old data.
Many historic computers built some type of input and output directly into the control unit. For example, many historic computers had a front panel with switches and lights directly controlled by the control unit. These let a programmer directly enter a program and debug it. In later production computers, the most common use of a front panel was to a enter a small bootstrap program to read the operating system from disk. This was annoying. So, front panels were replaced by bootstrap programs in read-only memory.
Most PDP-8 models had a data bus designed to let I/O devices borrow the control unit's memory read and write logic.This reduced the complexity and expense of high speed I/O controllers, e.g. for disk.
The Xerox Alto had a multitasking microprogammable control unit that performed almost all I/O.This design provided most of the features of a modern PC with only a tiny fraction of the electronic logic. The dual-thread computer was run by the two lowest-priority microthreads. These performed calculations whenever I/O was not required. High priority microthreads provided (in decreasing priority) video, network, disk, a periodic timer, mouse, and keyboard. The microprogram did the complex logic of the I/O device, as well as the logic to integrate the device with the computer. For the actual hardware I/O, the microprogram read and wrote shift registers for most I/O, sometimes with resistor networks and transistors to shift output voltage levels (e.g. for video). To handle outside events, the microcontroller had microinterrupts to switch threads at the end of a thread's cycle, e.g. at the end of an instruction, or after a shift-register was accessed. The microprogram could be rewritten and reinstalled, which was very useful for a research computer.
Thus a program of instructions in memory will cause the CU to configure a CPU's data flows to manipulate the data correctly between instructions. This results in a computer that could run a complete program and require no human intervention to make hardware changes between instructions (as had to be done when using only punch cards for computations before stored programmed computers with CUs were invented).
Hardwired control units are implemented through use of combinational logic units, featuring a finite number of gates that can generate specific results based on the instructions that were used to invoke those responses. Hardwired control units are generally faster than the microprogrammed designs.
This design uses a fixed architecture—it requires changes in the wiring if the instruction set is modified or changed. It can be convienient for simple, fast computers.
A controller that uses this approach can operate at high speed; however, it has little flexibility. A complex instruction set can overwhelm a designer who uses ad-hoc logic design.
The hardwired approach has become less popular as computers have evolved. Previously, control units for CPUs used ad-hoc logic, and they were difficult to design.
The idea of microprogramming was introduced by Maurice Wilkes in 1951 as an intermediate level to execute computer program instructions. Microprograms were organized as a sequence of microinstructions and stored in special control memory. The algorithm for the microprogram control unit, unlike the hardwired control unit, is usually specified by flowchart description.The main advantage of the microprogram control unit is the simplicity of its structure. Outputs of the controller are organized in microinstructions and they can be easily replaced.
A popular variation on microcode is to debug the microcode using a software simulator. Then, the microcode is a table of bits. This is a logical truth table, that translates a microcode address into the control unit outputs. This truth table can be fed to a computer program that produces optimized electronic logic. The resulting control unit is almost as easy to design as microprogramming, but it has the fast speed and low number of logic elements of a hard wired control unit. The practical result resembles a Mealy machine or Richards controller.
A central processing unit (CPU), also called a central processor or main processor, is the electronic circuitry within a computer that executes instructions that make up a computer program. The CPU performs basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions. The computer industry has used the term "central processing unit" at least since the early 1960s. Traditionally, the term "CPU" refers to a processor, more specifically to its processing unit and control unit (CU), distinguishing these core elements of a computer from external components such as main memory and I/O circuitry.
A control store is the part of a CPU's control unit that stores the CPU's microprogram. It is usually accessed by a microsequencer. Early types of control store took the form of diode-arrays that were accessed via address decoders, but were later implemented as writable microcode that was stored in a form of read-only memory called a writable control store. The outputs generally had to go through a register to prevent a race condition from occurring. The register was clocked by the clock signal of the system it was running on.
Processor design is the design engineering task of creating a processor, a key component of computer hardware. It is a subfield of computer engineering and electronics engineering (fabrication). The design process involves choosing an instruction set and a certain execution paradigm and results in a microarchitecture, which might be described in e.g. VHDL or Verilog. For microprocessor design, this description is then manufactured employing some of the various semiconductor device fabrication processes, resulting in a die which is bonded onto a chip carrier. This chip carrier is then soldered onto, or inserted into a socket on, a printed circuit board (PCB).
The 8086 is a 16-bit microprocessor chip designed by Intel between early 1976 and June 8, 1978, when it was released. The Intel 8088, released July 1, 1979, is a slightly modified chip with an external 8-bit data bus, and is notable as the processor used in the original IBM PC design.
The Intel 8088 microprocessor is a variant of the Intel 8086. Introduced on June 1, 1979, the 8088 had an eight-bit external data bus instead of the 16-bit bus of the 8086. The 16-bit registers and the one megabyte address range were unchanged, however. In fact, according to the Intel documentation, the 8086 and 8088 have the same execution unit (EU)—only the bus interface unit (BIU) is different. The original IBM PC was based on the 8088, as were its clones.
Microcode is a computer hardware technique that interposes a layer of organisation between the CPU hardware and the programmer-visible instruction set architecture of the computer. As such, the microcode is a layer of hardware-level instructions that implement higher-level machine code instructions or internal state machine sequencing in many digital processing elements. Microcode is used in general-purpose central processing units, although in current desktop CPUs it is only a fallback path for cases that the faster hardwired control unit cannot handle.
A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor that can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows for more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.
The transputer is a series of pioneering microprocessors from the 1980s, featuring integrated memory and serial communication links, intended for parallel computing. They were designed and produced by Inmos, a semiconductor company based in Bristol, United Kingdom.
Central processing unit power dissipation or CPU power dissipation is the process in which central processing units (CPUs) consume electrical energy, and dissipate this energy in the form of heat due to the resistance in the electronic circuits.
In computer science, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.
In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.
The instruction cycle is the cycle which the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions. It is composed of three main stages: the fetch stage, the decode stage, and the execute stage.
In computer engineering, out-of-order execution is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.
A branch is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order. Branch may also refer to the act of switching execution to a different instruction sequence as a result of executing a branch instruction. Branch instructions are used to implement control flow in program loops and conditionals.
In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.
In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.
The POWER4 is a microprocessor developed by International Business Machines (IBM) that implemented the 64-bit PowerPC and PowerPC AS instruction set architectures. Released in 2001, the POWER4 succeeded the POWER3 and RS64 microprocessors, and was used in RS/6000 and AS/400 computers, ending a separate development of PowerPC microprocessors for the AS/400. The POWER4 was a multicore microprocessor, with two cores on a single die, the first non-embedded microprocessor to do so. POWER4 Chip was first commercially available multiprocessor chip. The original POWER4 had a clock speed of 1.1 and 1.3 GHz, while an enhanced version, the POWER4+, reached a clock speed of 1.9 GHz. The PowerPC 970 is a derivative of the POWER4.
In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction-set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.
The history of general-purpose CPUs is a continuation of the earlier history of computing hardware.
In computer engineering, computer architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems. Some definitions of architecture define it as describing the capabilities and programming model of a computer but not a particular implementation. In other definitions computer architecture involves instruction set architecture design, microarchitecture design, logic design, and implementation.