Register file

Last updated

A register file is an array of processor registers in a central processing unit (CPU). Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports. Such RAMs are distinguished by having dedicated read and write ports, whereas ordinary multiported SRAMs will usually read and write through the same ports.

In computer architecture, a processor register is a quickly accessible location available to a computer's central processing unit (CPU). Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only. Registers are typically addressed by mechanisms other than main memory, but may in some cases be assigned a memory address e.g. DEC PDP-10, ICT 1900.

Central processing unit electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logical, control and input/output (I/O) operations specified by the instructions

A central processing unit (CPU), also called a central processor or main processor, is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions. The computer industry has used the term "central processing unit" at least since the early 1960s. Traditionally, the term "CPU" refers to a processor, more specifically to its processing unit and control unit (CU), distinguishing these core elements of a computer from external components such as main memory and I/O circuitry.

Integrated circuit electronic circuit manufactured by lithography; set of electronic circuits on one small flat piece (or "chip") of semiconductor material, normally silicon

An integrated circuit or monolithic integrated circuit is a set of electronic circuits on one small flat piece of semiconductor material that is normally silicon. The integration of large numbers of tiny transistors into a small chip results in circuits that are orders of magnitude smaller, faster, and less expensive than those constructed of discrete electronic components. The IC's mass production capability, reliability, and building-block approach to circuit design has ensured the rapid adoption of standardized ICs in place of designs using discrete transistors. ICs are now used in virtually all electronic equipment and have revolutionized the world of electronics. Computers, mobile phones, and other digital home appliances are now inextricable parts of the structure of modern societies, made possible by the small size and low cost of ICs.

Contents

The instruction set architecture of a CPU will almost always define a set of registers which are used to stage data between memory and the functional units on the chip. In simpler CPUs, these architectural registers correspond one-for-one to the entries in a physical register file (PRF) within the CPU. More complicated CPUs use register renaming, so that the mapping of which physical entry stores a particular architectural register changes dynamically during execution. [1] The register file is part of the architecture and visible to the programmer, as opposed to the concept of transparent caches.

An instruction set architecture (ISA) is an abstract model of a computer. It is also referred to as architecture or computer architecture. A realization of an ISA is called an implementation. An ISA permits multiple implementations that may vary in performance, physical size, and monetary cost ; because the ISA serves as the interface between software and hardware. Software that has been written for an ISA can run on different implementations of the same ISA. This has enabled binary compatibility between different generations of computers to be easily achieved, and the development of computer families. Both of these developments have helped to lower the cost of computers and to increase their applicability. For these reasons, the ISA is one of the most important abstractions in computing today.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. While a programmer in assembly language refers for instance to a logical register accu, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and can not be referenced directly but only via the canonical names.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels.

Register bank switching

Register files may be clubbed together as register banks. [2] Some processors have several register banks.

ARM processors use ARM register banks for fast interrupt request. x86 processors use context switching and fast interrupt for switching between instruction, decoder, GPRs and register files, if there is more than one, before the instruction is issued, but this is only existing on processors that support superscalar. However, context switching is a totally different mechanism to ARM's register bank within the registers.

Fast Interrupt Requests (FIQs) are a specialized type of Interrupt Request, a standard technique used in computer CPUs to deal with events which need to be processed as they occur such as receiving data from a network card, or keyboard or mouse actions. FIQs are specific to the ARM CPU architecture, which supports two types of interrupts; FIQs for fast, low latency interrupt handling and Interrupt Requests (IRQs), for more general interrupts.

x86 family of instruction set architectures

x86 is a family of instruction set architectures based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors.

The MODCOMP and the later 8051-compatible processors use bits in the program status word to select the currently active register bank.

MODCOMP was a small minicomputer vendor that specialized in real-time applications. They were founded in 1970 in Fort Lauderdale, Florida. In the 1970s and 1980s, they produced a line of 16 and 32-bit mini-computers. Through the 1980s, MODCOMP lost market share as more powerful micro-computers became popular, and Digital Equipment Corporation's VAX and Alpha systems continued to grow. The company successfully survives today as a systems integrator.

Implementation

Regfile array.png

The usual layout convention is that a simple array is read out vertically. That is, a single word line, which runs horizontally, causes a row of bit cells to put their data on bit lines, which run vertically. Sense amps, which convert low-swing read bitlines into full-swing logic levels, are usually at the bottom (by convention). Larger register files are then sometimes constructed by tiling mirrored and rotated simple arrays.

A bit cell is the length of tape, the area of disc surface, or the part of an integrated circuit in which a single bit is recorded. The smaller the bit cells are, the better the storage density of the medium is.

In modern computer memory, a sense amplifier is one of the elements which make up the circuitry on a semiconductor memory chip ; the term itself dates back to the era of magnetic core memory. A sense amplifier is part of the read circuitry that is used when data is read from the memory; its role is to sense the low power signals from a bitline that represents a data bit stored in a memory cell, and amplify the small voltage swing to recognizable logic levels so the data can be interpreted properly by logic outside the memory.

Register files have one word line per entry per port, one bit line per bit of width per read port, and two bit lines per bit of width per write port. Each bit cell also has a Vdd and Vss. Therefore, the wire pitch area increases as the square of the number of ports, and the transistor area increases linearly. [3] At some point, it may be smaller and/or faster to have multiple redundant register files, with smaller numbers of read ports, rather than a single register file with all the read ports. The MIPS R8000's integer unit, for example, had a 9 read 4 write port 32 entry 64-bit register file implemented in a 0.7 µm process, which could be seen when looking at the chip from arm's length.

Two popular approaches to dividing registers into multiple register files are the distributed register file configuration and the partitioned register file configuration. [3]

In principle, any operation that could be done with a 64-bit-wide register file with many read and write ports could be done with a single 8-bit-wide register file with a single read port and a single write port. However, the bit-level parallelism of wide register files with many ports allows them to run much faster and thus, they can do operations in a single cycle that would take many cycles with fewer ports or a narrower bit width or both. [1]

The width in bits of the register file is usually the number of bits in the processor word size. Occasionally it is slightly wider in order to attach "extra" bits to each register, such as the poison bit. If the width of the data word is different than the width of an address—or in some cases, such as the 68000, even when they are the same width—the address registers are in a separate register file than the data registers.

Decoder

Array

A typical register file -- "triple-ported", able to read from 2 registers and write to 1 register simultaneously -- is made of bit cells like this one. Regfile cell.png
A typical register file -- "triple-ported", able to read from 2 registers and write to 1 register simultaneously -- is made of bit cells like this one.

The basic scheme for a bit cell:

Many optimizations are possible:

Microarchitecture

Most register files make no special provision to prevent multiple write ports from writing the same entry simultaneously. Instead, the instruction scheduling hardware ensures that only one instruction in any particular cycle writes a particular entry. If multiple instructions targeting the same register are issued, all but one have their write enables turned off.

The crossed inverters take some finite time to settle after a write operation, during which a read operation will either take longer or return garbage. It is common to have bypass multiplexers that bypass written data to the read ports when a simultaneous read and write to the same entry is commanded. These bypass multiplexers are often part of a larger bypass network that forwards results which have not yet been committed between functional units.

The register file is usually pitch-matched to the datapath that it serves. Pitch matching avoids having many busses passing over the datapath turn corners, which would use a lot of area. But since every unit must have the same bit pitch, every unit in the datapath ends up with the bit pitch forced by the widest unit, which can waste area in the other units. Register files, because they have two wires per bit per write port, and because all the bit lines must contact the silicon at every bit cell, can often set the pitch of a datapath.

Area can sometimes be saved, on machines with multiple units in a datapath, by having two datapaths side-by-side, each of which has smaller bit pitch than a single datapath would have. This case usually forces multiple copies of a register file, one for each datapath.

The Alpha 21264 (EV6), for instance, was the first large micro-architecture to implement "Shadow Register File Architecture". It had two copies of the integer register file and two copies of floating point register that locate in its front end (future and scaled file, each contain 2 read and 2 write port), and took an extra cycle to propagate data between the two during context switch. The issue logic attempted to reduce the number of operations forwarding data between the two and greatly improved its integer performance and help reduce the impact of limited number of GPR in superscalar and speculative execution. The design was later adapted by SPARC, MIPS and some later x86 implementation.

The MIPS uses multiple register file as well, R8000 floating-point unit had two copies of the floating-point register file, each with four write and four read ports, and wrote both copies at the same time with context switch. However it does not support integer operation and integer register file still remain one. Later shadow register file was abandoned in newer design in favor of embedded market.

The SPARC uses "Shadow Register File Architecture" as well for its high end line, It had up to 4 copies of integer register files (future, retired, scaled, scratched, each contain 7 read 4 write port) and 2 copies of floating point register file. but unlike Alpha and x86, they are locate in back end as retire unit right after its Out of Order Unit and renaming register files and do not load instruction during instruction fetch and decoding stage and context switch is needless in this design.

IBM uses the same mechanism as many major microprocessors, deeply merging the register file with the decoder but its register file are work independently by the decoder side and do not involve context switch, which is different from Alpha and x86. most of its register file not just serve for its dedicate decoder only but up to the thread level. For example, POWER8 has up to 8 instruction decoders, but up to 32 register files of 32 general purpose registers each (4 read and 4 write port), to facilitate simultaneous multithreading, which its instruction cannot be used cross any other register file (lack of context switch.).

In the x86 processor line, a typical pre-486 CPU did not have an individual register file, as all general purpose register were directly work with its decoder, and the x87 push stack was located within the floating-point unit itself. Starting with Pentium, a typical Pentium-compatible x86 processor is integrated with one copy of the single-port architectural register file containing 8 architectural registers, 8 control registers, 8 debug registers, 8 condition code registers, 8 unnamed based register,[ clarification needed ] one instruction pointer, one flag register and 6 segment registers in one file.

One copy of 8 x87 FP push down stack by default, MMX register were virtually simulated from x87 stack and require x86 register to supplying MMX instruction and aliases to exist stack. On P6, the instruction independently can be stored and executed in parallel in early pipeline stages before decoding into micro-operations and renaming in out-of-order execution. Beginning with P6, all register files do not require additional cycle to propagate the data, register files like architectural and floating point are located between code buffer and decoders, called "retire buffer", Reorder buffer and OoOE and connected within the ring bus (16 bytes). The register file itself still remains one x86 register file and one x87 stack and both serve as retirement storing. Its x86 register file increased to dual ported to increase bandwidth for result storage. Registers like debug/condition code/control/unnamed/flag were stripped from the main register file and placed into individual files between the micro-op ROM and instruction sequencer. Only inaccessible registers like the segment register are now separated from the general-purpose register file (except the instruction pointer); they are now located between the scheduler and instruction allocator, in order to facilitate register renaming and out-of-order execution. The x87 stack was later merged with the floating-point register file after a 128-bit XMM register debuted in Pentium III, but the XMM register file is still located separately from x86 integer register files.

Later P6 implementations (Pentium M, Yonah) introduced "Shadow Register File Architecture" that expanded to 2 copies of dual ported integer architectural register file and consist with context switch (between future&retirered file and scaled file using the same trick that used between integer and floating point). It was in order to solve the register bottleneck that exist in x86 architecture after micro op fusion is introduced, but it is still have 8 entries 32 bit architectural registers for total 32 bytes in capacity per file (segment register and instruction pointer remain within the file, though they are inaccessible by program) as speculative file. The second file is served as a scaled shadow register file, which without context switch the scaled file cannot store some instruction independently. Some instruction from SSE2/SSE3/SSSE3 require this feature for integer operation, for example instruction like PSHUFB, PMADDUBSW, PHSUBW, PHSUBD, PHSUBSW, PHADDW, PHADDD, PHADDSW would require loading EAX/EBX/ECX/EDX from both of register file, though it was uncommon for x86 processor to take use of another register file with same instruction; most of time the second file is served as a scale retirered file. The Pentium M architecture still remains one dual-ported FP register file (8 entries MM/XMM) shared with three decoder and FP register does not have shadow register file with it as its Shadow Register File Architecture did not including floating point function. Processor after P6, the architectural register file are external and locate in processor's backend after retired, opposite to internal register file that are locate in inner core for register renaming/reorder buffer. However, in Core 2 it is now within a unit called "register alias table" RAT, located with instruction allocator but have same size of register size as retirement. Core 2 increased the inner ring bus to 24 bytes (allow more than 3 instructions to be decoded) and extended its register file from dual ported (one read/one write) to quad ported (two read/two write), register still remain 8 entries in 32 bit and 32 bytes (not including 6 segment register and one instruction pointer as they are unable to be access in the file by any code/instruction) in total file size and expanded to 16 entries in x64 for total 128 bytes size per file. From Pentium M as its pipeline port and decoder increased, but they're located with allocator table instead of code buffer. Its FP XMM register file are also increase to quad ported (2 read/2 write), register still remain 8 entries in 32 bit and extended to 16 entries in x64 mode and number still remain 1 as its shadow register file architecture is not including floating point/SSE functions.

In later x86 implementations, like Nehalem and later processors, both integer and floating point registers are now incorporated into a unified octa-ported (six read and two write) general-purpose register file (8 + 8 in 32-bit and 16 + 16 in x64 per file), while the register file extended to 2 with enhanced "Shadow Register File Architecture" in favorite of executing hyper threading and each thread uses independent register files for its decoder. Later Sandy bridge and onward replaced shadow register table and architectural registers with much large and yet more advance physical register file before decoding to the reorder buffer. Randered that Sandy Bridge and onward no longer carry an architectural register.

On the Atom line was the modern simplified revision of P5. It includes single copies of register file share with thread and decoder. The register file is a dual-port design, 8/16 entries GPRS, 8/16 entries debug register and 8/16 entries condition code are integrated in the same file. However it has an eight-entries 64 bit shadow based register and an eight-entries 64 bit unnamed register that are now separated from main GPRs unlike the original P5 design and located after the execution unit, and the file of these registers is single-ported and not expose to instruction like scaled shadow register file found on Core/Core2 (shadow register file are made of architectural registers and Bonnell did not due to not have "Shadow Register File Architecture"), however the file can be use for renaming purpose due to lack of out of order execution found on Bonnell architecture. It also had one copy of XMM floating point register file per thread. The difference from Nehalem is Bonnell do not have a unified register file and has no dedicated register file for its hyper threading. Instead, Bonnell uses a separate rename register for its thread despite it is not out of order. Similar to Bonnell, Larrabee and Xeon Phi also each have only one general-purpose integer register file, but the Larrabee has up to 16 XMM register files (8 entries per file), and the Xeon Phi has up to 128 AVX-512 register files, each containing 32 512-bit ZMM registers for vector instruction storage, which can be as big as L2 cache.

There are some other of Intel's x86 lines that don't have a register file in their internal design, Geode GX and Vortex86 and many embedded processors that aren't Pentium-compatible or reverse-engineered early 80x86 processors. Therefore, most of them don't have a register file for their decoders, but their GPRs are used individually. Pentium 4, on the other hand, does not have a register file for its decoder, as its x86 GPRs didn't exist within its structure, due to the introduction of a physical unified renaming register file (similar to Sandy Bridge, but slightly different due to the inability of Pentium 4 to use the register before naming) for attempting to replace the architectural register file and skip the x86 decoding scheme. Instead it uses SSE for integer execution and storage before the ALU and after result, SSE2/SSE3/SSSE3 use the same mechanism as well for its integer operation.

AMD's early design like K6 do not have a register file like Intel and do not support "Shadow Register File Architecture" as its lack of context switch and bypass inverter that are necessary require for a register file to function appropriately. Instead they use a separate GPRs that directly link to a rename register table for its OoOE CPU with a dedicated integer decoder and floating decoder. The mechanism is similar to Intel's pre-Pentium processor line. For example, the K6 processor has four int (one eight-entries temporary scratched register file + one eight-entries future register file + one eight-entries fetched register file + an eight-entries unnamed register file) and two FP rename register files (two eight-entries x87 ST file one goes fadd and one goes fmov) that directly link with its x86 EAX for integer renaming and XMM0 register for floating point renaming, but later Athlon included "shadow register" in its front end, it's scaled up to 40 entries unified register file for in order integer operation before decoded, the register file contain 8 entries scratch register + 16 future GPRs register file + 16 unnamed GPRs register file. In later AMD designs it abandons the shadow register design and favored to K6 architecture with individual GPRs direct link design. Like Phenom, it has three int register files and two SSE register files that are located in the physical register file directly linked with GPRs. However, it scales down to one integer + one floating-point on Bulldozer. Like early AMD designs, most of the x86 manufacturers like Cyrix, VIA, DM&P, and SIS used the same mechanism as well, resulting in a lack of integer performance without register renaming for their in-order CPU. Companies like Cyrix and AMD had to increase cache size in hope to reduce the bottleneck. AMD's SSE integer operation work in a different way than Core 2 and Pentium 4; it uses its separate renaming integer register to load the value directly before the decode stage. Though theoretically it will only need a shorter pipeline than Intel's SSE implementation, but generally the cost of branch prediction are much greater and higher missing rate than Intel, and it would have to take at least two cycles for its SSE instruction to be executed regardless of instruction wide, as early AMDs implementations could not execute both FP and Int in an SSE instruction set like Intel's implementation did.

Unlike Alpha, Sparc, and MIPS that only allows one register file to load/fetch one operand at the time; it would require multiple register files to achieve superscale. The ARM processor on the other hand does not integrate multiple register files to load/fetch instructions. ARM GPRs have no special purpose to the instruction set (the ARM ISA does not require accumulator, index, and stack/base points. Registers do not have an accumulator and base/stack point can only be used in thumb mode). Any GPRs can propagate and store multiple instructions independently in smaller code size that is small enough to be able to fit in one register and its architectural register act as a table and shared with all decoder/instructions with simple bank switching between decoders. The major difference between ARM and other designs is that ARM allows to run on the same general-purpose register with quick bank switching without requiring additional register file in superscalar. Despite x86 sharing the same mechanism with ARM that its GPRs can store any data individually, x86 will confront data dependency if more than three non-related instructions are stored, as its GPRs per file are too small (eight in 32 bit mode and 16 in 64 bit, compared to ARM's 13 in 32 bit and 31 in 64 bit) for data, and it is impossible to have superscalar without multiple register files to feed to its decoder (x86 code is big and complex compared to ARM). Because most x86's front-ends have become much larger and much more power hungry than the ARM processor in order to be competitive (example: Pentium M & Core 2 Duo, Bay Trail). Some third-party x86 equivalent processors even became noncompetitive with ARM due to having no dedicated register file architecture. Particularly for AMD, Cyrix and VIA that cannot bring any reasonable performance without register renaming and out of order execution, which leave only Intel Atom to be the only in-order x86 processor core in the mobile competition. This was until the x86 Nehalem processor merged both of its integer and floating point register into one single file, and the introduction of a large physical register table and enhanced allocator table in its front-end before renaming in its out-of-order internal core.

Register renaming

Processors that perform register renaming can arrange for each functional unit to write to a subset of the physical register file. This arrangement can eliminate the need for multiple write ports per bit cell, for large savings in area. The resulting register file, effectively a stack of register files with single write ports, then benefits from replication and subsetting the read ports. At the limit, this technique would place a stack of 1-write, 2-read regfiles at the inputs to each functional unit. Since regfiles with a small number of ports are often dominated by transistor area, it is best not to push this technique to this limit, but it is useful all the same.

Register windows

The SPARC ISA defines register windows, in which the 5-bit architectural names of the registers actually point into a window on a much larger register file, with hundreds of entries. Implementing multiported register files with hundreds of entries requires a large area. The register window slides by 16 registers when moved, so that each architectural register name can refer to only a small number of registers in the larger array, e.g. architectural register r20 can only refer to physical registers #20, #36, #52, #68, #84, #100, #116, if there are just seven windows in the physical file.

To save area, some SPARC implementations implement a 32-entry register file, in which each cell has seven "bits". Only one is read and writeable through the external ports, but the contents of the bits can be rotated. A rotation accomplishes in a single cycle a movement of the register window. Because most of the wires accomplishing the state movement are local, tremendous bandwidth is possible with little power.

This same technique is used in the R10000 register renaming mapping file, which stores a 6-bit virtual register number for each of the physical registers. In the renaming file, the renaming state is checkpointed whenever a branch is taken, so that when a branch is detected to be mispredicted, the old renaming state can be recovered in a single cycle. (See Register renaming.)

See also

Related Research Articles

IA-32 is the 32-bit version of the x86 instruction set architecture, designed by Intel and first implemented in the 80386 microprocessor in 1985. IA-32 is the first incarnation of x86 that supports 32-bit computing; as a result, the "IA-32" term may be used as a metonym to refer to all x86 versions that support 32-bit computing.

P5 (microarchitecture) Intel microporocessor

The first Pentium microprocessor was introduced by Intel on March 22, 1993. Dubbed P5, its microarchitecture was the fifth generation for Intel, and the first superscalar IA-32 microarchitecture. As a direct extension of the 80486 architecture, it included dual integer pipelines, a faster floating-point unit, wider data bus, separate code and data caches and features for further reduced address calculation latency. In 1996, the Pentium with MMX Technology was introduced with the same basic microarchitecture complemented with an MMX instruction set, larger caches, and some other enhancements.

MMX (instruction set) instruction set designed by Intel

MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1997 with its P5-based Pentium line of microprocessors, designated as "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel processor. MMX is a processor supplementary capability that is supported on recent IA-32 processors by Intel and other vendors.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating point data. SIMD instructions can greatly increase performance when the exact same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

In computer architecture, 64-bit computing is the use of processors that have datapath widths, integer size, and memory address widths of 64 bits. Also, 64-bit computer architectures for central processing units (CPUs) and arithmetic logic units (ALUs) are those that are based on processor registers, address buses, or data buses of that size. From the software perspective, 64-bit computing means the use of code with 64-bit virtual memory addresses. However, not all 64-bit instruction sets support full 64-bit virtual memory addresses; x86-64 and ARMv8, for example, support only 48 bits of virtual address, with the remaining 16 bits of the virtual address required to be all 0's or all 1's, and several 64-bit instruction sets support fewer than 64 bits of physical memory address.

The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel introduced in November 1, 1995. It introduced the P6 microarchitecture and was originally intended to replace the original Pentium in a full range of applications. While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors, respectively, the Pentium Pro contained 5.5 million transistors. Later, it was reduced to a more narrow role as a server and high-end desktop processor and was used in supercomputers like ASCI Red, the first computer to reach the teraFLOPS performance mark. The Pentium Pro was capable of both dual- and quad-processor configurations. It only came in one form factor, the relatively large rectangular Socket 8. The Pentium Pro was succeeded by the Pentium II Xeon in 1998.

3DNow! is an extension to the x86 instruction set developed by Advanced Micro Devices (AMD). It adds single instruction multiple data (SIMD) instructions to the base x86 instruction set, enabling it to perform vector processing, which improves the performance of many graphic-intensive applications. The first microprocessor to implement 3DNow was the AMD K6-2, which was introduced in 1998. When the application was appropriate this raised the speed by about 2–4 times.

x86 assembly language is a family of backward-compatible assembly languages, which provide some level of compatibility all the way back to the Intel 8008 introduced in April 1972. x86 assembly languages are used to produce object code for the x86 class of processors. Like all assembly languages, it uses short mnemonics to represent the fundamental instructions that the CPU in a computer can understand and follow. Compilers sometimes produce assembly code as an intermediate step when translating a high level program into machine code. Regarded as a programming language, assembly coding is machine-specific and low level. Assembly languages are more typically used for detailed and time critical applications such as small real-time embedded systems or operating system kernels and device drivers.

x86-64 64-bit version of the x86 instruction set

x86-64 is the 64-bit version of the x86 instruction set. It introduces two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode. With 64-bit mode and the new paging mode, it supports vastly larger amounts of virtual memory and physical memory than is possible on its 32-bit predecessors, allowing programs to store larger amounts of data in memory. x86-64 also expands general-purpose registers to 64-bit, as well extends the number of them from 8 to 16, and provides numerous other enhancements. Floating point operations are supported via mandatory SSE2-like instructions, and x87/MMX style registers are generally not used ; instead, a set of 32 vector registers, 128 bits each, is used. In 64-bit mode, instructions are modified to support 64-bit operands and 64-bit addressing mode. The compatibility mode allows 16- and 32-bit user applications to run unmodified coexisting with 64-bit applications if the 64-bit operating system supports them. As the full x86 16-bit and 32-bit instruction sets remain implemented in hardware without any intervening emulation, these older executables can run with little or no performance penalty, while newer or modified applications can take advantage of new features of the processor design to achieve performance improvements. Also, a processor supporting x86-64 still powers on in real mode for full backward compatibility.

SSE2 is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2000. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions (PNI), is the third iteration of the SSE instruction set for the IA-32 (x86) architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU. In April 2005, AMD introduced a subset of SSE3 in revision E of their Athlon 64 CPUs. The earlier SIMD instruction sets on the x86 platform, from oldest to newest, are MMX, 3DNow!, SSE, and SSE2.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

POWER3

The POWER3 is a microprocessor, designed and exclusively manufactured by IBM, that implemented the 64-bit version of the PowerPC instruction set architecture (ISA), including all of the optional instructions of the ISA such as instructions present in the POWER2 version of the POWER ISA but not in the PowerPC ISA. It was introduced on 5 October 1998, debuting in the RS/6000 43P Model 260, a high-end graphics workstation. The POWER3 was originally supposed to be called the PowerPC 630 but was renamed, probably to differentiate the server-oriented POWER processors it replaced from the more consumer-oriented 32-bit PowerPCs. The POWER3 was the successor of the P2SC derivative of the POWER2 and completed IBM's long-delayed transition from POWER to PowerPC, which was originally scheduled to conclude in 1995. The POWER3 was used in IBM RS/6000 servers and workstations at 200 MHz. It competed with the Digital Equipment Corporation (DEC) Alpha 21264 and the Hewlett-Packard (HP) PA-8500.

Advanced Vector Extensions are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.

The F16C instruction set is an x86 instruction set architecture extension which provides support for converting between half-precision and standard IEEE single-precision floating-point formats.

Goldmont Plus is a microarchitecture for low-power Atom, Celeron and Pentium Silver branded processors used in systems on a chip (SoCs) made by Intel. The Gemini Lake platform with 14 nm Goldmont Plus core was officially launched on December 11, 2017.

References