Translation lookaside buffer

Last updated

A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. [1] It can be called an address-translation cache. It is a part of the chip's memory-management unit (MMU). A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.

Contents

The TLB is sometimes implemented as content-addressable memory (CAM). The CAM search key is the virtual address, and the search result is a physical address. If the requested address is present in the TLB, the CAM search yields a match quickly and the retrieved physical address can be used to access memory. This is called a TLB hit. If the requested address is not in the TLB, it is a miss, and the translation proceeds by looking up the page table in a process called a page walk. The page walk is time-consuming when compared to the processor speed, as it involves reading the contents of multiple memory locations and using them to compute the physical address. After the physical address is determined by the page walk, the virtual address to physical address mapping is entered into the TLB. The PowerPC 604, for example, has a two-way set-associative TLB for data loads and stores. [2] Some processors have different instruction and data address TLBs.

Overview

General working of TLB Translation Lookaside Buffer.png
General working of TLB

A TLB has a fixed number of slots containing page-table entries and segment-table entries; page-table entries map virtual addresses to physical addresses and intermediate-table addresses, while segment-table entries map virtual addresses to segment addresses, intermediate-table addresses and page-table addresses. The virtual memory is the memory space as seen from a process; this space is often split into pages of a fixed size (in paged memory), or less commonly into segments of variable sizes (in segmented memory). The page table, generally stored in main memory, keeps track of where the virtual pages are stored in the physical memory. This method uses two memory accesses (one for the page-table entry, one for the byte) to access a byte. First, the page table is looked up for the frame number. Second, the frame number with the page offset gives the actual address. Thus, any straightforward virtual memory scheme would have the effect of doubling the memory access time. Hence, the TLB is used to reduce the time taken to access the memory locations in the page-table method. The TLB is a cache of the page table, representing only a subset of the page-table contents.

Referencing the physical memory addresses, a TLB may reside between the CPU and the CPU cache, between the CPU cache and primary storage memory, or between levels of a multi-level cache. The placement determines whether the cache uses physical or virtual addressing. If the cache is virtually addressed, requests are sent directly from the CPU to the cache, and the TLB is accessed only on a cache miss. If the cache is physically addressed, the CPU does a TLB lookup on every memory operation, and the resulting physical address is sent to the cache.

In a Harvard architecture or modified Harvard architecture, a separate virtual address space or memory-access hardware may exist for instructions and data. This can lead to distinct TLBs for each access type, an instruction translation lookaside buffer (ITLB) and a data translation lookaside buffer (DTLB). Various benefits have been demonstrated with separate data and instruction TLBs. [4]

The TLB can be used as a fast lookup hardware cache. The figure shows the working of a TLB. Each entry in the TLB consists of two parts: a tag and a value. If the tag of the incoming virtual address matches the tag in the TLB, the corresponding value is returned. Since the TLB lookup is usually a part of the instruction pipeline, searches are fast and cause essentially no performance penalty. However, to be able to search within the instruction pipeline, the TLB has to be small.

A common optimization for physically addressed caches is to perform the TLB lookup in parallel with the cache access. Upon each virtual memory reference, the hardware checks the TLB to see whether the page number is held therein. If yes, it is a TLB hit, and the translation is made. The frame number is returned and is used to access the memory. If the page number is not in the TLB, the page table must be checked. Depending on the CPU, this can be done automatically using a hardware or using an interrupt to the operating system. When the frame number is obtained, it can be used to access the memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full, a suitable block must be selected for replacement. There are different replacement methods like least recently used (LRU), first in, first out (FIFO) etc.; see the address translation section in the cache article for more details about virtual addressing as it pertains to caches and TLBs.

Performance implications

Flowchart shows the working of a translation lookaside buffer. For simplicity, the page-fault routine is not mentioned. Steps In a Translation Lookaside Buffer.png
Flowchart shows the working of a translation lookaside buffer. For simplicity, the page-fault routine is not mentioned.

The CPU has to access main memory for an instruction-cache miss, data-cache miss, or TLB miss. The third case (the simplest one) is where the desired information itself actually is in a cache, but the information for virtual-to-physical translation is not in a TLB. These are all slow, due to the need to access a slower level of the memory hierarchy, so a well-functioning TLB is important. Indeed, a TLB miss can be more expensive than an instruction or data cache miss, due to the need for not just a load from main memory, but a page walk, requiring several memory accesses.

The flowchart provided explains the working of a TLB. If it is a TLB miss, then the CPU checks the page table for the page table entry. If the present bit is set, then the page is in main memory, and the processor can retrieve the frame number from the page-table entry to form the physical address. [6] The processor also updates the TLB to include the new page-table entry. Finally, if the present bit is not set, then the desired page is not in the main memory, and a page fault is issued. Then a page-fault interrupt is called, which executes the page-fault handling routine.

If the page working set does not fit into the TLB, then TLB thrashing occurs, where frequent TLB misses occur, with each newly cached page displacing one that will soon be used again, degrading performance in exactly the same way as thrashing of the instruction or data cache does. TLB thrashing can occur even if instruction-cache or data-cache thrashing are not occurring, because these are cached in different-size units. Instructions and data are cached in small blocks (cache lines), not entire pages, but address lookup is done at the page level. Thus, even if the code and data working sets fit into cache, if the working sets are fragmented across many pages, the virtual-address working set may not fit into TLB, causing TLB thrashing. Appropriate sizing of the TLB thus requires considering not only the size of the corresponding instruction and data caches, but also how these are fragmented across multiple pages.

Multiple TLBs

Similar to caches, TLBs may have multiple levels. CPUs can be (and nowadays usually are) built with multiple TLBs, for example a small L1 TLB (potentially fully associative) that is extremely fast, and a larger L2 TLB that is somewhat slower. When instruction-TLB (ITLB) and data-TLB (DTLB) are used, a CPU can have three (ITLB1, DTLB1, TLB2) or four TLBs.

For instance, Intel's Nehalem microarchitecture has a four-way set associative L1 DTLB with 64 entries for 4 KiB pages and 32 entries for 2/4 MiB pages, an L1 ITLB with 128 entries for 4 KiB pages using four-way associativity and 14 fully associative entries for 2/4 MiB pages (both parts of the ITLB divided statically between two threads) [7] and a unified 512-entry L2 TLB for 4 KiB pages, [8] both 4-way associative. [9]

Some TLBs may have separate sections for small pages and huge pages. For example, Intel Skylake microarchitecture separates the TLB entries for 1 GiB pages from those for 4 KiB/2 MiB pages. [10]

TLB-miss handling

Two schemes for handling TLB misses are commonly found in modern architectures:

The MIPS architecture specifies a software-managed TLB. [12]

The SPARC V9 architecture allows an implementation of SPARC V9 to have no MMU, an MMU with a software-managed TLB, or an MMU with a hardware-managed TLB, [13] and the UltraSPARC Architecture 2005 specifies a software-managed TLB. [14]

The Itanium architecture provides an option of using either software- or hardware-managed TLBs. [15]

The Alpha architecture's TLB is managed in PALcode, rather than in the operating system. As the PALcode for a processor can be processor-specific and operating-system-specific, this allows different versions of PALcode to implement different page-table formats for different operating systems, without requiring that the TLB format, and the instructions to control the TLB, to be specified by the architecture. [16]

Typical TLB

These are typical performance levels of a TLB: [17]

The average effective memory cycle rate is defined as cycles, where is the number of cycles required for a memory read, is the miss rate, and is the hit time in cycles. If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, a memory read takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of (31.29 clock cycles per memory access). [18]

Address-space switch

On an address-space switch, as occurs when context switching between processes (but not between threads), some TLB entries can become invalid, since the virtual-to-physical mapping is different. The simplest strategy to deal with this is to completely flush the TLB. This means that after a switch, the TLB is empty, and any memory reference will be a miss, so it will be some time before things are running back at full speed. Newer CPUs use more effective strategies marking which process an entry is for. This means that if a second process runs for only a short time and jumps back to a first process, the TLB may still have valid entries, saving the time to reload them. [19]

Other strategies avoid flushing the TLB on a context switch: (a) A single address space operating system uses the same virtual-to-physical mapping for all processes. (b) Some CPUs have a process ID register, and the hardware uses TLB entries only if they match the current process ID.

For example, in the Alpha 21264, each TLB entry is tagged with an address space number (ASN), and only TLB entries with an ASN matching the current task are considered valid. Another example in the Intel Pentium Pro, the page global enable (PGE) flag in the register CR4 and the global (G) flag of a page-directory or page-table entry can be used to prevent frequently used pages from being automatically invalidated in the TLBs on a task switch or a load of register CR3. Since the 2010 Westmere microarchitecture Intel 64 processors also support 12-bit process-context identifiers (PCIDs), which allow retaining TLB entries for multiple linear-address spaces, with only those that match the current PCID being used for address translation. [20] [21]

While selective flushing of the TLB is an option in software-managed TLBs, the only option in some hardware TLBs (for example, the TLB in the Intel 80386) is the complete flushing of the TLB on an address-space switch. Other hardware TLBs (for example, the TLB in the Intel 80486 and later x86 processors, and the TLB in ARM processors) allow the flushing of individual entries from the TLB indexed by virtual address.

Flushing of the TLB can be an important security mechanism for memory isolation between processes to ensure a process can't access data stored in memory pages of another process. Memory isolation is especially critical during switches between the privileged operating system kernel process and the user processes – as was highlighted by the Meltdown security vulnerability. Mitigation strategies such as kernel page-table isolation (KPTI) rely heavily on performance-impacting TLB flushes and benefit greatly from hardware-enabled selective TLB entry management such as PCID. [22]

Virtualization and x86 TLB

With the advent of virtualization for server consolidation, a lot of effort has gone into making the x86 architecture easier to virtualize and to ensure better performance of virtual machines on x86 hardware. [23] [24]

Normally, entries in the x86 TLBs are not associated with a particular address space; they implicitly refer to the current address space. Hence, every time there is a change in address space, such as a context switch, the entire TLB has to be flushed. Maintaining a tag that associates each TLB entry with an address space in software and comparing this tag during TLB lookup and TLB flush is very expensive, especially since the x86 TLB is designed to operate with very low latency and completely in hardware. In 2008, both Intel (Nehalem) [25] and AMD (SVM) [26] have introduced tags as part of the TLB entry and dedicated hardware that checks the tag during lookup. Not all operating systems made full use of these tags immediately, but Linux 4.14 started using them to identify recently used address spaces, since the 12-bits PCIDs (4095 different values) are insufficient for all tasks running on a given CPU. [27]

See also

Related Research Articles

x86 Family of instruction set architectures

x86 is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a fully 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors. Colloquially, their names were "186", "286", "386" and "486".

In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit central processing units (CPU) and arithmetic logic units (ALU) are those that are based on processor registers, address buses, or data buses of that size. A computer that uses such a processor is a 64-bit computer.

x86 memory segmentation refers to the implementation of memory segmentation in the Intel x86 computer instruction set architecture. Segmentation was introduced on the Intel 8086 in 1978 as a way to allow programs to address more than 64 KB (65,536 bytes) of memory. The Intel 80286 introduced a second version of segmentation in 1982 that added support for virtual memory and memory protection. At this point the original mode was renamed to real mode, and the new version was named protected mode. The x86-64 architecture, introduced in 2003, has largely dropped support for segmentation in 64-bit mode.

In computing, protected mode, also called protected virtual address mode, is an operational mode of x86-compatible central processing units (CPUs). It allows system software to use features such as segmentation, virtual memory, paging and safe multi-tasking designed to increase an operating system's control over application software.

<span class="mw-page-title-main">Memory management unit</span> Hardware translating virtual addresses to physical address

A memory management unit (MMU), sometimes called paged memory management unit (PMMU), is a computer hardware unit that examines all memory references on the memory bus, translating these requests, known as virtual memory addresses, into physical addresses in main memory.

x86-64 64-bit version of x86 architecture

x86-64 is a 64-bit version of the x86 instruction set, first announced in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.

Memory protection is a way to control memory access rights on a computer, and is a part of most modern instruction set architectures and operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug or malware within a process from affecting other processes, or the operating system itself. Protection may encompass all accesses to a specified area of memory, write accesses, or attempts to execute the contents of the area. An attempt to access unauthorized memory results in a hardware fault, e.g., a segmentation fault, storage violation exception, generally causing abnormal termination of the offending process. Memory protection for computer security includes additional techniques such as address space layout randomization and executable space protection.

The NX bit (no-execute) is a technology used in CPUs to segregate areas of a virtual address space to store either data or processor instructions. An operating system with support for the NX bit may mark certain areas of an address space as non-executable. The processor will then refuse to execute any code residing in these areas of the address space. The general technique, known as executable space protection, also called Write XOR Execute, is used to prevent certain types of malicious software from taking over computers by inserting their code into another program's data storage area and running their own code from within this section; one class of such attacks is known as the buffer overflow attack.

<span class="mw-page-title-main">Page table</span> Data structure that maps virtual addresses with physical addresses

A page table is the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. Virtual addresses are used by the program executed by the accessing process, while physical addresses are used by the hardware, or more specifically, by the random-access memory (RAM) subsystem. The page table is a key component of virtual address translation that is necessary to access data in memory.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

x86 virtualization is the use of hardware-assisted virtualization capabilities on an x86/x86-64 CPU.

Memory segmentation is an operating system memory management technique of dividing a computer's primary memory into segments or sections. In a computer system using segmentation, a reference to a memory location includes a value that identifies a segment and an offset within that segment. Segments or sections are also used in object files of compiled programs when they are linked together into a program image and when the image is loaded into memory.

In the x86-64 computer architecture, long mode is the mode where a 64-bit operating system can access 64-bit instructions and registers. 64-bit programs are run in a sub-mode called 64-bit mode, while 32-bit programs and 16-bit protected mode programs are executed in a sub-mode called compatibility mode. Real mode or virtual 8086 mode programs cannot be natively run in long mode.

<span class="mw-page-title-main">Input–output memory management unit</span> Configuration in computing

In computing, an input–output memory management unit (IOMMU) is a memory management unit (MMU) connecting a direct-memory-access–capable (DMA-capable) I/O bus to the main memory. Like a traditional MMU, which translates CPU-visible virtual addresses to physical addresses, the IOMMU maps device-visible virtual addresses to physical addresses. Some units also provide memory protection from faulty or malicious devices.

<span class="mw-page-title-main">Multithreading (computer architecture)</span> Ability of a CPU to provide multiple threads of execution concurrently

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory, described by a single entry in the page table. It is the smallest unit of data for memory management in a virtual memory operating system. Similarly, a page frame is the smallest fixed-length contiguous block of physical memory into which memory pages are mapped by the operating system.

In computing, Page Size Extension (PSE) refers to a feature of x86 processors that allows for pages larger than the traditional 4 KiB size. It was introduced in the original Pentium processor, but it was only publicly documented by Intel with the release of the Pentium Pro. The CPUID instruction can be used to identify the availability of PSE on x86 CPUs.

Second Level Address Translation (SLAT), also known as nested paging, is a hardware-assisted virtualization technology which makes it possible to avoid the overhead associated with software-managed shadow page tables.

<span class="mw-page-title-main">Meltdown (security vulnerability)</span> Microprocessor security vulnerability

Meltdown is one of the two original transient execution CPU vulnerabilities. Meltdown affects Intel x86 microprocessors, IBM POWER processors, and some ARM-based microprocessors. It allows a rogue process to read all memory, even when it is not authorized to do so.

<span class="mw-page-title-main">Intel 5-level paging</span> Processor extension for the x86-64 line of processors

Intel 5-level paging, referred to simply as 5-level paging in Intel documents, is a processor extension for the x86-64 line of processors. It extends the size of virtual addresses from 48 bits to 57 bits, increasing the addressable virtual memory from 256 TB to 128 PB. The extension was first implemented in the Ice Lake processors, and the 4.14 Linux kernel adds support for it. Windows 10 and 11 with server versions also support this extension in their latest updates, where it is provided by a separate kernel of the system called ntkrla57.exe.

References

  1. Arpaci-Dusseau, Remzi H.; Arpaci-Dusseau, Andrea C. (2014), Operating Systems: Three Easy Pieces [Chapter: Faster Translations (TLBs)] (PDF), Arpaci-Dusseau Books
  2. S. Peter Song; Marvin Denman; Joe Chang (October 1994). "The PowerPC 604 RISC Microprocessor" (PDF). IEEE Micro. 14 (5): 13–14. doi:10.1109/MM.1994.363071. S2CID   11603864. Archived from the original (PDF) on 1 June 2016.
  3. Silberschatz, Galvin, Gagne, Abraham, Peter B. , Greg (2009). Operating Systems Concepts . United States of America: John Wiley & Sons. INC. ISBN   978-0-470-12872-5.{{cite book}}: CS1 maint: multiple names: authors list (link)
  4. Chen, J. Bradley; Borg, Anita; Jouppi, Norman P. (1992). "A Simulation Based Study of TLB Performance". SIGARCH Computer Architecture News. 20 (2): 114–123. doi: 10.1145/146628.139708 .
  5. Stallings, William (2014). Operating Systems: Internals and Design Principles. United States of America: Pearson. ISBN   978-0133805918.
  6. Solihin, Yan (2016). Fundamentals of Parallel Multicore Architecture. Boca Raton, FL: Taylor & Francis Group. ISBN   978-0-9841630-0-7.
  7. "Inside Nehalem: Intel's Future Processor and System". Real World Technologies.
  8. "Intel Core i7 (Nehalem): Architecture By AMD?". Tom's Hardware. 14 October 2008. Retrieved 24 November 2010.
  9. "Inside Nehalem: Intel's Future Processor and System". Real World Technologies. Retrieved 24 November 2010.
  10. Srinivas, Suresh; Pawar, Uttam; Aribuki, Dunni; Manciu, Catalin; Schulhof, Gabriel; Prasad, Aravinda (1 November 2019). "Runtime Performance Optimization Blueprint: Intel® Architecture Optimization with Large Code Pages" . Retrieved 22 October 2022.
  11. J. Smith and R. Nair. Virtual Machines: Versatile Platforms for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design). Morgan Kaufmann Publishers Inc., 2005.
  12. Welsh, Matt. "MIPS r2000/r3000 Architecture". Archived from the original on 14 October 2008. Retrieved 16 November 2008. If no matching TLB entry is found, a TLB miss exception occurs
  13. SPARC International, Inc. The SPARC Architecture Manual, Version 9. PTR Prentice Hall.
  14. Sun Microsystems. UltraSPARC Architecture 2005. Draft D0.9.2, 19 June 2008. Sun Microsystems.
  15. Virtual Memory in the IA-64 Kernel > Translation Lookaside Buffer.
  16. Compaq Computer Corporation. Alpha Architecture Handbook (PDF). Version 4. Compaq Computer Corporation. Archived from the original (PDF) on 9 October 2014. Retrieved 1 December 2010.
  17. David A. Patterson; John L. Hennessy (2009). Computer Organization And Design. Hardware/Software interface. 4th edition. Burlington, MA 01803, USA: Morgan Kaufmann Publishers. p. 503. ISBN   978-0-12-374493-7.{{cite book}}: CS1 maint: location (link)
  18. "Translation Lookaside Buffer (TLB) in Paging". GeeksforGeeks. 26 February 2019. Retrieved 10 February 2021.
  19. Ulrich Drepper (9 October 2014). "Memory part 3: Virtual Memory". LWN.net.
  20. David Kanter (17 March 2010). "Westmere Arrives". Real World Tech. Retrieved 6 January 2018.
  21. Intel Corporation (2017). "4.10.1 Process-Context Identifiers (PCIDs)". Intel 64 and IA-32 Architectures Software Developer's Manual (PDF). Vol. 3A: System Programming Guide, Part 1.
  22. Gil Tene (8 January 2018). "PCID is now a critical performance/security feature on x86" . Retrieved 23 March 2018.
  23. D. Abramson; J. Jackson; S. Muthrasanallur; G. Neiger; G. Regnier; R. Sankaran; I. Schoinas; R. Uhlig; B. Vembu; J. Wiegert. "Intel Virtualization Technology for Directed I/O". Intel Technology Journal. 10 (3): 179–192.
  24. Advanced Micro Devices. AMD Secure Virtual Machine Architecture Reference Manual. Advanced Micro Devices, 2008.
  25. G. Neiger; A. Santoni; F. Leung; D. Rodgers; R. Uhlig. "Intel Virtualization Technology: Hardware Support for Efficient Processor Virtualization". Intel Technology Journal. 10 (3).
  26. Advanced Micro Devices. AMD Secure Virtual Machine Architecture Reference Manual. Advanced Micro Devices, 2008.
  27. "Longer-lived TLB Entries with PCID". Kernelnewbies. 30 December 2017. Retrieved 31 July 2023.