Prefetch input queue

Last updated

Fetching the instruction opcodes from program memory well in advance is known as prefetching and it is served by using a prefetch input queue (PIQ). The pre-fetched instructions are stored in a queue. The fetching of opcodes well in advance, prior to their need for execution, increases the overall efficiency of the processor boosting its speed. The processor no longer has to wait for the memory access operations for the subsequent instruction opcode to complete. This architecture was prominently used in the Intel 8086 microprocessor.

Contents

Introduction

Pipelining was brought to the forefront of computing architecture design during the 1960s due to the need for faster and more efficient computing. Pipelining is the broader concept and most modern processors load their instructions some clock cycles before they execute them. This is achieved by pre-loading machine code from memory into a prefetch input queue.

This behavior[ clarification needed ] only applies to von Neumann computers (that is, not Harvard architecture computers) that can run self-modifying code and have some sort of instruction pipelining. Nearly all modern high-performance computers fulfill these three requirements. [1]

Usually, the prefetching behavior of the PIQ is invisible to the programming model of the CPU. However, there are some circumstances where the behavior of PIQ is visible, and needs to be taken into account by the programmer.

When an x86 processor changes mode from real mode to protected mode and vice versa, the PIQ has to be flushed, or else the CPU will continue to translate the machine code as if it were written in its last mode. If the PIQ is not flushed, the processor might translate its codes wrong and generate an invalid instruction exception.

When executing self-modifying code, a change in the processor code immediately in front of the current location of execution might not change how the processor interprets the code, as it is already loaded into its PIQ. It simply executes its old copy already loaded in the PIQ instead of the new and altered version of the code in its RAM and/or cache.

This behavior of the PIQ can be used to determine if code is being executed inside an emulator or directly on the hardware of a real CPU.[ citation needed ] Most emulators will probably never simulate this behavior. If the PIQ-size is zero (changes in the code always affect the state of the processor immediately), it can be deduced that either the code is being executed in an emulator or the processor invalidates the PIQ upon writes to addresses loaded in the PIQ.

Performance evaluation based on queuing theory

It was A.K Erlang (1878-1929) who first conceived of a queue as a solution to congestion in telephone traffic. Different queueing models are proposed in order to approximately simulate the real time queuing systems so that those can be analysed mathematically for different performance specifications.

Queuing models can be represented using Kendall's notation:

A1/A2/A3/A4

where:

  1. M/M/1 Model (Single Queue Single Server/ Markovian): In this model, elements of queue are served on a first-come, first-served basis. Given the mean arrival and service rates, then actual rates vary around these average values randomly and hence have to be determined using a cumulative probability distribution function. [2]
  2. M/M/r Model: This model is a generalization of the basic M/M/1 model where multiple servers operate in parallel. This kind of model can also model scenarios with impatient users who leave the queue immediately if they are not receiving service. This can also be modeled using a Bernoulli process having only two states, success and failure. The best example of this model is our regular land-line telephone systems. [3]
  3. M/G/1 Model (Takacs' finite input Model) : This model is used to analyze advanced cases. Here the service time distribution is no longer a Markov process. This model considers the case of more than one failed machine being repaired by single repairman. Service time for any user is going to increase in this case. [4]

Generally in applications like prefetch input queue, M/M/1 Model is popularly used because of limited use of queue features. In this model in accordance with microprocessors, the user takes the role of the execution unit and server is the bus interface unit.

Instruction queue

The processor executes a program by fetching the instructions from memory and executing them. Usually the processor execution speed is much faster than the memory access speed. Instruction queue is used to prefetch the next instructions in a separate buffer while the processor is executing the current instruction.

With a four stage pipeline, the rate at which instructions are executed can be up to four times that of sequential execution. [5]

The processor usually has two separate units for fetching the instructions and for executing the instructions. [6] [7]

The implementation of a pipeline architecture is possible only if the bus interface unit and the execution unit are independent. While the execution unit is decoding or executing an instruction which does not require the use of the data and address buses, the bus interface unit fetches instruction opcodes from the memory.

This process is much faster than sending out an address, reading the opcode and then decoding and executing it. Fetching the next instruction while the current instruction is being decoded or executed is called pipelining. [8]

The 8086 processor has a six-byte prefetch instruction pipeline, while the 8088 has a four-byte prefetch. As the Execution Unit is executing the current instruction, the bus interface unit reads up to six (or four) bytes of opcodes in advance from the memory. The queue lengths were chosen based on simulation studies. [9]

An exception is encountered when the execution unit encounters a branch instruction i.e. either a jump or a call instruction. In this case, the entire queue must be dumped and the contents pointed to by the instruction pointer must be fetched from memory.

Drawbacks

Processors implementing the instruction queue prefetch algorithm are rather technically advanced. The CPU design level complexity of the such processors is much higher than for regular processors. This is primarily because of the need to implement two separate units, the BIU and EU, operating separately.

As the complexity of these chips increases, the cost also increases. These processors are relatively costlier than their counterparts without the prefetch input queue.

However, these disadvantages are greatly offset by the improvement in processor execution time. After the introduction of prefetch instruction queue in the 8086 processor, all successive processors have incorporated this feature.

x86 example code

code_starts_here:movbx,aheadmovwordptrcs:[bx],9090hahead:jmpnearto_the_end; Some other codeto_the_end:

This self-modifying program will overwrite the jmp to_the_end with two NOPs (which is encoded as 0x9090). The jump jmp near to_the_end is assembled into two bytes of machine code, so the two NOPs will just overwrite this jump and nothing else. (That is, the jump is replaced with a do-nothing-code.)

Because the machine code of the jump is already read into the PIQ, and probably also already executed by the processor (superscalar processors execute several instructions at once, but they "pretend" that they don't because of the need for backward compatibility), the change of the code will not have any change of the execution flow.

Example program to detect size

This is an example NASM-syntax self-modifying x86-assembly language algorithm that determines the size of the PIQ:

code_starts_here:xorbx,bx; zero register bxxorax,ax; zero register axmovdx,csmov[code_segment],dx; "calculate" codeseg in the far jump below (edx here too)around:cmpax,1; check if ax has been alteredjefound_size; 0x90 = opcode "nop" (NO oPeration)movbyte[nop_field+bx],0x90incbxdb0xEA; 0xEA = opcode "far jump"dwflush_queue; should be followed by offset (rm = "dw", pm = "dd")code_segment:dw0; and then the code segment (calculated above)flush_queue:; 0x40 = opcode "inc ax" (INCrease ax)movbyte[nop_field+bx],0x40nop_field:times256nopjmparoundfound_size:;;    register bx now contains the size of the PIQ;    this code is for [[real mode]] and [[16-bit protected mode]], but it could easily be changed into;    running for [[32-bit protected mode]] as well. just change the "dw" for;    the offset to "dd". you need also change dx to edx at the top as;    well. (dw and dx = 16 bit addressing, dd and edx = 32 bit addressing);

What this code does is basically that it changes the execution flow, and determines by brute force how large the PIQ is. "How far away do I have to change the code in front of me for it to affect me?" If it is too near (it is already in the PIQ) the update will not have any effect. If it is far enough, the change of the code will affect the program and the program has then found the size of the processor's PIQ. If this code is being executed under multitasking OS, the context switch may lead to the wrong value.

See also

Related Research Articles

<span class="mw-page-title-main">Assembly language</span> Low-level programming language

In computer programming, assembly language, often referred to simply as Assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence between the instructions in the language and the architecture's machine code instructions. Assembly language usually has one statement per machine instruction (1:1), but constants, comments, assembler directives, symbolic labels of, e.g., memory locations, registers, and macros are generally also supported.

<span class="mw-page-title-main">Intel 8086</span> 16-bit microprocessor

The 8086 is a 16-bit microprocessor chip designed by Intel between early 1976 and June 8, 1978, when it was released. The Intel 8088, released July 1, 1979, is a slightly modified chip with an external 8-bit data bus, and is notable as the processor used in the original IBM PC design.

<span class="mw-page-title-main">Intel 8088</span> Intel microprocessor model

The Intel 8088 microprocessor is a variant of the Intel 8086. Introduced on June 1, 1979, the 8088 has an eight-bit external data bus instead of the 16-bit bus of the 8086. The 16-bit registers and the one megabyte address range are unchanged, however. In fact, according to the Intel documentation, the 8086 and 8088 have the same execution unit (EU)—only the bus interface unit (BIU) is different. The original IBM PC is based on the 8088, as are its clones.

<span class="mw-page-title-main">Machine code</span> Set of instructions executed by a computer

In computer programming, machine code is computer code consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Although decimal computers were once common, the contemporary marketplace is dominated by binary computers; for those computers, machine code is "the binary representation of a computer program which is actually read and interpreted by the computer. A program in machine code consists of a sequence of machine instructions ."

x86 Family of instruction set architectures

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors. Colloquially, their names were "186", "286", "386" and "486".

<span class="mw-page-title-main">Intel 8085</span> 8-bit microprocessor by Intel

The Intel 8085 ("eighty-eighty-five") is an 8-bit microprocessor produced by Intel and introduced in March 1976. It is software-binary compatible with the more-famous Intel 8080 with only two minor instructions added to support its added interrupt and serial input/output features. However, it requires less support circuitry, allowing simpler and less expensive microcomputer systems to be built. The "5" in the part number highlighted the fact that the 8085 uses a single +5-volt (V) power supply by using depletion-mode transistors, rather than requiring the +5 V, −5 V and +12 V supplies needed by the 8080. This capability matched that of the competing Z80, a popular 8080-derived CPU introduced the year before. These processors could be used in computers running the CP/M operating system.

x86 memory segmentation refers to the implementation of memory segmentation in the Intel x86 computer instruction set architecture. Segmentation was introduced on the Intel 8086 in 1978 as a way to allow programs to address more than 64 KB (65,536 bytes) of memory. The Intel 80286 introduced a second version of segmentation in 1982 that added support for virtual memory and memory protection. At this point the original mode was renamed to real mode, and the new version was named protected mode. The x86-64 architecture, introduced in 2003, has largely dropped support for segmentation in 64-bit mode.

In computing, protected mode, also called protected virtual address mode, is an operational mode of x86-compatible central processing units (CPUs). It allows system software to use features such as segmentation, virtual memory, paging and safe multi-tasking designed to increase an operating system's control over application software.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

In computer engineering, Halt and Catch Fire, known by the assembly mnemonic HCF, is an idiom referring to a computer machine code instruction that causes the computer's central processing unit (CPU) to cease meaningful operation, typically requiring a restart of the computer. It originally referred to a fictitious instruction in IBM System/360 computers, making a joke about its numerous non-obvious instruction mnemonics.

<span class="mw-page-title-main">General protection fault</span>

A general protection fault (GPF) in the x86 instruction set architectures (ISAs) is a fault initiated by ISA-defined protection mechanisms in response to an access violation caused by some running code, either in the kernel or a user program. The mechanism is first described in Intel manuals and datasheets for the Intel 80286 CPU, which was introduced in 1983; it is also described in section 9.8.13 in the Intel 80386 programmer's reference manual from 1986. A general protection fault is implemented as an interrupt. Some operating systems may also classify some exceptions not related to access violations, such as illegal opcode exceptions, as general protection faults, even though they have nothing to do with memory protection. If a CPU detects a protection violation, it stops executing the code and sends a GPF interrupt. In most cases, the operating system removes the failing process from the execution queue, signals the user, and continues executing other processes. If, however, the operating system fails to catch the general protection fault, i.e. another protection violation occurs before the operating system returns from the previous GPF interrupt, the CPU signals a double fault, stopping the operating system. If yet another failure occurs, the CPU is unable to recover; since 80286, the CPU enters a special halt state called "Shutdown", which can only be exited through a hardware reset. The IBM PC AT, the first PC-compatible system to contain an 80286, has hardware that detects the Shutdown state and automatically resets the CPU when it occurs. All descendants of the PC AT do the same, so in a PC, a triple fault causes an immediate system reset.

The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.

<span class="mw-page-title-main">Intel 8087</span> Floating-point microprocessor made by Intel

The Intel 8087, announced in 1980, was the first x87 floating-point coprocessor for the 8086 line of microprocessors.

The Program Segment Prefix (PSP) is a data structure used in DOS systems to store the state of a program. It resembles the Zero Page in the CP/M operating system. The PSP has the following structure:

<span class="mw-page-title-main">WDC 65C02</span> CMOS microprocessor in the 6502 family

The Western Design Center (WDC) 65C02 microprocessor is an enhanced CMOS version of the popular nMOS-based 8-bit MOS Technology 6502. The 65C02 fixed several problems in the original 6502 and added some new instructions, but its main feature was greatly lowered power usage, on the order of 10 to 20 times less than the original 6502 running at the same speed. The reduced power consumption made the 65C02 useful in portable computer roles and microcontroller systems in industrial settings. It has been used in some home computers, as well as in embedded applications, including medical-grade implanted devices.

In computing, the x86 memory models are a set of six different memory models of the x86 CPU operating in real mode which control how the segment registers are used and the default size of pointers.

A Trace Vector Decoder (TVD) is computer software that uses the trace facility of its underlying microprocessor to decode encrypted instruction opcodes just-in-time prior to execution and possibly re-encode them afterwards. It can be used to hinder reverse engineering when attempting to prevent software cracking as part of an overall copy protection strategy.

<span class="mw-page-title-main">TI-990</span> Series of 16-bit computers by Texas Instruments.

The TI-990 was a series of 16-bit minicomputers sold by Texas Instruments (TI) in the 1970s and 1980s. The TI-990 was a replacement for TI's earlier minicomputer systems, the TI-960 and the TI-980. It had several unique features, and was easier to program than its predecessors.

A stack register is a computer central processor register whose purpose is to keep track of a call stack. On an accumulator-based architecture machine, this may be a dedicated register. On a machine with mulitple general-purpose registers, it may be a register that is reserved by convention, such as on the IBM System/360 through z/Architecture architecture and RISC architectures, or it may be a register that procedure call and return instructions are hardwired to use, such as on the PDP-11, VAX, and Intel x86 architectures. Some designs such as the Data General Eclipse had no dedicated register, but used a reserved hardware memory address for this function.

A trap flag permits operation of a processor in single-step mode. If such a flag is available, debuggers can use it to step through the execution of a computer program.

References

  1. "ARM Information Center". ARM Technical Support Knowledge Articles.
  2. Hayes, John (1998). Computer Architecture and Organization (Second ed.). McGraw-Hill.
  3. Feller, William (1968). An Introduction to Probability theory and its applications (Second ed.). John Wiley and Sons.
  4. Papoulis, Athanasios; S.Unnikrishna Pillai (2008). Probability, Random Variables and Stochastic Processes (Fourth ed.). McGraw-Hill. pp. 784 to 800.
  5. Zaky, Safwat; V. Carl Hamacher; Zvonko G. Vranesic (1996). Computer Organization (Fourth ed.). McGraw-Hill. pp.  310–329. ISBN   0-07-114309-2.
  6. "Block diagram of 8086 CPU".
  7. Hall, Douglas (2006). Microprocessors and Interfacing. Tata McGraw-Hill. p. 2.12. ISBN   0-07-060167-4.
  8. Hall, Douglas (2006). Microprocessors and Interfacing. New Delhi: Tata McGraw-Hill. pp. 2.13–2.14. ISBN   0-07-060167-4.
  9. McKevitt, James; Bayliss, John (March 1979). "New options from big chips". IEEE Spectrum. 16 (3): 28–34. doi:10.1109/MSPEC.1979.6367944. S2CID   25154920.