Trace cache

Last updated December 27, 2024

In computer architecture, a trace cache or execution trace cache is a specialized instruction cache which stores the dynamic stream of instructions known as trace. It helps in increasing the instruction fetch bandwidth and decreasing power consumption (in the case of Intel Pentium 4) by storing traces of instructions that have already been fetched and decoded.^[1] A trace processor^[2] is an architecture designed around the trace cache and processes the instructions at trace level granularity. The formal mathematical theory of traces is described by trace monoids.

Background

The earliest academic publication of trace cache was "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching".^[1] This widely acknowledged paper was presented by Eric Rotenberg, Steve Bennett, and Jim Smith at 1996 International Symposium on Microarchitecture (MICRO) conference. An earlier publication is US patent 5381533,^[3] by Alex Peleg and Uri Weiser of Intel, "Dynamic flow instruction cache memory organized around trace segments independent of virtual address line", a continuation of an application filed in 1992, later abandoned.

Necessity

Wider superscalar processors demand multiple instructions to be fetched in a single cycle for higher performance. Instructions to be fetched are not always in contiguous memory locations (basic blocks) because of branch and jump instructions. So processors need additional logic and hardware support to fetch and align such instructions from non-contiguous basic blocks. If multiple branches are predicted as not-taken, then processors can fetch instructions from multiple contiguous basic blocks in a single cycle. However, if any of the branches is predicted as taken, then processor should fetch instructions from the taken path in that same cycle. This limits the fetch capability of a processor.

Basic blocks of a simple if-else loop BasicBlocks.png — Basic blocks of a simple if-else loop

Consider these four basic blocks (A, B, C, D) as shown in the figure that correspond to a simple if-else loop. These blocks will be stored contiguously as ABCD in the memory. If the branch D is predicted not-taken, the fetch unit can fetch the basic blocks A, B, C which are placed contiguously. However, if D is predicted taken, the fetch unit has to fetch A,B,D which are non-contiguously placed. Hence, fetching these blocks which are non contiguously placed, in a single cycle will be very difficult. So, in situations like these, the trace cache comes in aid to the processor.

Once fetched, the trace cache stores the instructions in their dynamic sequence. When these instructions are encountered again, the trace cache allows the instruction fetch unit of a processor to fetch several basic blocks from it without having to worry about branches in the execution flow. Instructions will be stored in the trace cache either after they have been decoded, or as they are retired. However, instruction sequence is speculative if they are stored just after decode stage.

Trace structure

A trace, also called a dynamic instruction sequence, is an entry in the trace cache. It can be characterized by maximum number of instructions and maximum basic blocks. Traces can start at any dynamic instruction. Multiple traces can have same starting instruction i.e., same starting program counter (PC) and instructions from different basic blocks as per the branch outcomes. For the figure above, ABC and ABD are valid traces. They both start at the same PC (address of A) and have different basic blocks as per D's prediction.

Traces usually terminate when one of the following occurs:

Trace got filled with allowable maximum number of instructions
Trace has allowable maximum basic blocks
Return instructions
Indirect branches
System calls

Trace control information

A single trace will have following information:

Starting PC - PC of the first instruction in trace
Branch flag - ( maximum basic blocks -1) branch predictions
Branch mask - number of branches in the trace and whether trace ends in a branch or not
Trace fall through - Next PC if last instruction is not-taken branch or not a branch
Trace target - address of last branch's taken target

Trace cache design

Following are the factors that need to be considered while designing a trace cache.

Trace selection policies - maximum number of instructions and maximum basic blocks in a trace
Associativity - number of ways a cache can have
Cache indexing method - concatenation or XOR with PC bits
Path associativity - traces with same starting PC but with different basic blocks can be mapped to different sets
Trace cache fill choices -
1. After decode stage (speculative)
2. After retire stage

A trace cache is not on the critical path of instruction fetch^[4]

Hit/miss logic

Trace lines are stored in the trace cache based on the PC of the first instruction in the trace and a set of branch predictions. This allows for storing different trace paths that start on the same address, each representing different branch outcomes. This method of tagging helps to provide path associativity to the trace cache. Other method can include having only starting PC as tag in trace cache. In the instruction fetch stage of a pipeline, the current PC along with a set of branch predictions is checked in the trace cache for a hit. If there is a hit, a trace line is supplied to fetch unit which does not have to go to a regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit until the trace line ends or until there is a misprediction in the pipeline. If there is a miss, a new trace starts to be built.

The Pentium 4's execution trace cache stores micro-operations resulting from decoding x86 instructions, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.^[5]

Disadvantages

The disadvantages of trace cache are:

Redundant instruction storage between trace cache and instruction cache and within trace cache itself.^[6]
Power inefficiency and hardware complexity^[4]

Execution trace cache

Within the L1 cache of the NetBurst CPUs, Intel incorporated its execution trace cache.^[7]^[8] It stores decoded micro-operations, so that when executing a new instruction, instead of fetching and decoding the instruction again, the CPU directly accesses the decoded micro-ops from the trace cache, thereby saving considerable time. Moreover, the micro-ops are cached in their predicted path of execution, which means that when instructions are fetched by the CPU from the cache, they are already present in the correct order of execution. Intel later introduced a similar but simpler concept with Sandy Bridge called micro-operation cache (UOP cache).

Related Research Articles

In processor design, microcode serves as an intermediary layer situated between the central processing unit (CPU) hardware and the programmer-visible instruction set architecture of a computer, also known as its machine code. It consists of a set of hardware-level instructions that implement the higher-level machine code instructions or control internal finite-state machine sequencing in many digital processing components. While microcode is utilized in Intel and AMD general-purpose CPUs in contemporary desktops and laptops, it functions only as a fallback path for scenarios that the faster hardwired control unit is unable to manage.

The Pentium is a microprocessor introduced by Intel on March 22, 1993. It is the first CPU using the Pentium brand. Considered the fifth generation in the x86 (8086) compatible line of processors, succeeding the i486, its implementation and microarchitecture was internally called P5.

The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It introduced the P6 microarchitecture and was originally intended to replace the original Pentium in a full range of applications. Later, it was reduced to a more narrow role as a server and high-end desktop processor. The Pentium Pro was also used in supercomputers, most notably ASCI Red, which used two Pentium Pro CPUs on each computing node and was the first computer to reach over one teraFLOPS in 1996, holding the number one spot in the TOP500 list from 1997 to 2000.

In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures.

The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of central processing units (CPUs) made by Intel. The first CPU to use this architecture was the Willamette-core Pentium 4, released on November 20, 2000 and the first of the Pentium 4 CPUs; all subsequent Pentium 4 and Pentium D variants have also been based on NetBurst. In mid-2001, Intel released the Foster core, which was also based on NetBurst, thus switching the Xeon CPUs to the new architecture as well. Pentium 4-based Celeron CPUs also use the NetBurst architecture.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

In computer engineering, out-of-order execution is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

The P6 microarchitecture is the sixth-generation Intel x86 microarchitecture, implemented by the Pentium Pro microprocessor that was introduced in November 1995. It is frequently referred to as i686. It was planned to be succeeded by the NetBurst microarchitecture used by the Pentium 4 in 2000, but was revived for the Pentium M line of microprocessors. The successor to the Pentium M variant of the P6 microarchitecture is the Core microarchitecture which in turn is also derived from P6.

<span class="mw-page-title-main">Micro-operation</span> Low-level instructions used in some designs to implement complex machine instructions

In computer central processing units, micro-operations are detailed low-level instructions used in some designs to implement complex machine instructions.

Sandy Bridge is the codename for Intel's 32 nm microarchitecture used in the second generation of the Intel Core processors. The Sandy Bridge microarchitecture is the successor to Nehalem and Westmere microarchitecture. Intel demonstrated an A1 stepping Sandy Bridge processor in 2009 during Intel Developer Forum (IDF), and released first products based on the architecture in January 2011 under the Core brand.

Goldmont is a microarchitecture for low-power Atom, Celeron and Pentium branded processors used in systems on a chip (SoCs) made by Intel. They allow only one thread per core.

Goldmont Plus is a microarchitecture for low-power Celeron and Pentium Silver branded processors used in systems on a chip (SoCs) made by Intel. The Gemini Lake platform with 14 nm Goldmont Plus core was officially launched on December 11, 2017. Intel launched the Gemini Lake Refresh platform on November 4, 2019.

Intel microcode is microcode that runs inside x86 processors made by Intel. Since the P6 microarchitecture introduced in the mid-1990s, the microcode programs can be patched by the operating system or BIOS firmware to work around bugs found in the CPU after release. Intel had originally designed microcode updates for processor debugging under its design for testing (DFT) initiative.

Tremont is a microarchitecture for low-power Atom, Celeron and Pentium Silver branded processors used in systems on a chip (SoCs) made by Intel. It is the successor to Goldmont Plus. Intel officially launched Elkhart Lake platform with 10 nm Tremont core on September 23, 2020. Intel officially launched Jasper Lake platform with 10 nm Tremont core on January 11, 2021.

Golden Cove is a codename for a CPU microarchitecture developed by Intel and released in November 2021. It succeeds four microarchitectures: Sunny Cove, Skylake, Willow Cove, and Cypress Cove. It is fabricated using Intel's Intel 7 process node, previously referred to as 10 nm Enhanced SuperFin (10ESF).

Lion Cove is a 64-bit, two-way, x86 CPU core architecture designed by Intel. The Lion Cove core is featured in Core Ultra Series 2 Arrow Lake and Lunar Lake processors.

References

1 2 Rotenberg, Eric; Bennett, Steve; Smith, James E.; Rotenberg, Eric (1996-01-01). "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching". In Proceedings of the 29th International Symposium on Microarchitecture: 24–34.
↑ Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and James E. Smith. Trace Processors. Proceedings of the30th IEEE/ACM International Symposium on Microarchitecture (MICRO-30), pp. 138-148, December 1997
↑ Peleg, Alexander; Weiser, Uri (Jan 10, 1995), Dynamic flow instruction cache memory organized around trace segments independent of virtual address line , retrieved 2016-10-18
1 2 Leon Gu; Dipti Motiani (October 2003). "Trace Cache" (PDF). Retrieved2013-10-06.
↑ Agner Fog (2014-02-19). "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers" (PDF). agner.org. Retrieved 2014-03-21.
↑ Co, Michele. "Trace Cache". www.cs.virginia.edu. Retrieved 2016-10-21.^{[ dead link ‍]}
↑ Carmean, Doug (Spring 2002). "The Intel® Pentium® 4 Processor" (PDF). Archived from the original (PDF) on 19 April 2018.^{[ unreliable source? ]}
↑ "X-bit labs - Print version". www.xbitlabs.com. Archived from the original on 6 March 2016. Retrieved 12 January 2022.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 Rotenberg, Eric; Bennett, Steve; Smith, James E.; Rotenberg, Eric (1996-01-01). "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching". In Proceedings of the 29th International Symposium on Microarchitecture: 24–34.

[2] Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and James E. Smith. Trace Processors. Proceedings of the30th IEEE/ACM International Symposium on Microarchitecture (MICRO-30), pp. 138-148, December 1997

[3] Peleg, Alexander; Weiser, Uri (Jan 10, 1995), Dynamic flow instruction cache memory organized around trace segments independent of virtual address line , retrieved 2016-10-18

[:1-4] 1 2 Leon Gu; Dipti Motiani (October 2003). "Trace Cache" (PDF). Retrieved2013-10-06.

[5] Agner Fog (2014-02-19). "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers" (PDF). agner.org. Retrieved 2014-03-21.

[6] Co, Michele. "Trace Cache". www.cs.virginia.edu. Retrieved 2016-10-21.^{[ dead link ‍]}

[7] Carmean, Doug (Spring 2002). "The Intel® Pentium® 4 Processor" (PDF). Archived from the original (PDF) on 19 April 2018.^{[ unreliable source? ]}

[8] "X-bit labs - Print version". www.xbitlabs.com. Archived from the original on 6 March 2016. Retrieved 12 January 2022.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]