In a computer operating system that uses paging for virtual memory management, page replacement algorithms decide which memory pages to page out, sometimes called swap out, or write to disk, when a page of memory needs to be allocated. Page replacement happens when a requested page is not in memory (page fault) and a free page cannot be used to satisfy the allocation, either because there are none, or because the number of free pages is lower than some threshold.
When the page that was selected for replacement and paged out is referenced again it has to be paged in (read in from disk), and this involves waiting for I/O completion. This determines the quality of the page replacement algorithm: the less time waiting for page-ins, the better the algorithm. A page replacement algorithm looks at the limited information about accesses to the pages provided by hardware, and tries to guess which pages should be replaced to minimize the total number of page misses, while balancing this with the costs (primary storage and processor time) of the algorithm itself.
The page replacing problem is a typical online problem from the competitive analysis perspective in the sense that the optimal deterministic algorithm is known.
Page replacement algorithms were a hot topic of research and debate in the 1960s and 1970s. That mostly ended with the development of sophisticated LRU (least recently used) approximations and working set algorithms. Since then, some basic assumptions made by the traditional page replacement algorithms were invalidated, resulting in a revival of research. In particular, the following trends in the behavior of underlying hardware and user-level software have affected the performance of page replacement algorithms:
Requirements for page replacement algorithms have changed due to differences in operating system kernel architectures. In particular, most modern OS kernels have unified virtual memory and file system caches, requiring the page replacement algorithm to select a page from among the pages of both user program virtual address spaces and cached files. The latter pages have specific properties. For example, they can be locked, or can have write ordering requirements imposed by journaling. Moreover, as the goal of page replacement is to minimize total time waiting for memory, it has to take into account memory requirements imposed by other kernel sub-systems that allocate memory. As a result, page replacement in modern kernels (Linux, FreeBSD, and Solaris) tends to work at the level of a general purpose kernel memory allocator, rather than at the higher level of a virtual memory subsystem.
Replacement algorithms can be local or global.
When a process incurs a page fault, a local page replacement algorithm selects for replacement some page that belongs to that same process (or a group of processes sharing a memory partition). A global replacement algorithm is free to select any page in memory.
Local page replacement assumes some form of memory partitioning that determines how many pages are to be assigned to a given process or a group of processes. Most popular forms of partitioning are fixed partitioning and balanced set algorithms based on the working set model. The advantage of local page replacement is its scalability: each process can handle its page faults independently, leading to more consistent performance for that process. However global page replacement is more efficient on an overall system basis. [1]
Modern general purpose computers and some embedded processors have support for virtual memory. Each process has its own virtual address space. A page table maps a subset of the process virtual addresses to physical addresses. In addition, in most architectures the page table holds an "access" bit and a "dirty" bit for each page in the page table. The CPU sets the access bit when the process reads or writes memory in that page. The CPU sets the dirty bit when the process writes memory in that page. The operating system can modify the access and dirty bits. The operating system can detect accesses to memory and files through the following means:
read
and write
in POSIX.Most replacement algorithms simply return the target page as their result. This means that if target page is dirty (that is, contains data that have to be written to the stable storage before page can be reclaimed), I/O has to be initiated to send that page to the stable storage (to clean the page). In the early days of virtual memory, time spent on cleaning was not of much concern, because virtual memory was first implemented on systems with full duplex channels to the stable storage, and cleaning was customarily overlapped with paging. Contemporary commodity hardware, on the other hand, does not support full duplex transfers, and cleaning of target pages becomes an issue.
To deal with this situation, various precleaning policies are implemented. Precleaning is the mechanism that starts I/O on dirty pages that are (likely) to be replaced soon. The idea is that by the time the precleaned page is actually selected for the replacement, the I/O will complete and the page will be clean. Precleaning assumes that it is possible to identify pages that will be replaced next. Precleaning that is too eager can waste I/O bandwidth by writing pages that manage to get re-dirtied before being selected for replacement.
The (h,k)-paging problem is a generalization of the model of paging problem: Let h,k be positive integers such that . We measure the performance of an algorithm with cache of size relative to the theoretically optimal page replacement algorithm. If , we provide the optimal page replacement algorithm with strictly less resource.
The (h,k)-paging problem is a way to measure how an online algorithm performs by comparing it with the performance of the optimal algorithm, specifically, separately parameterizing the cache size of the online algorithm and optimal algorithm.
This section may be confusing or unclear to readers.(December 2015) |
Marking algorithms is a general class of paging algorithms. For each page, we associate it with a bit called its mark. Initially, we set all pages as unmarked. During a stage (a period of operation or a sequence of requests) of page requests, we mark a page when it is first requested in this stage. A marking algorithm is such an algorithm that never pages out a marked page.
If ALG is a marking algorithm with a cache of size k, and OPT is the optimal algorithm with a cache of size h, where , then ALG is -competitive. So every marking algorithm attains the -competitive ratio.
LRU is a marking algorithm while FIFO is not a marking algorithm.
An algorithm is conservative, if on any consecutive request sequence containing k or fewer distinct page references, the algorithm will incur k or fewer page faults.
If ALG is a conservative algorithm with a cache of size k, and OPT is the optimal algorithm with a cache of , then ALG is -competitive. So every conservative algorithm attains the -competitive ratio.
LRU, FIFO and CLOCK are conservative algorithms.
There are a variety of page replacement algorithms: [2]
The theoretically optimal page replacement algorithm (also known as OPT, clairvoyant replacement algorithm, or Bélády's optimal page replacement policy) [3] [4] [2] is an algorithm that works as follows: when a page needs to be swapped in, the operating system swaps out the page whose next use will occur farthest in the future. For example, a page that is not going to be used for the next 6 seconds will be swapped out over a page that is going to be used within the next 0.4 seconds.
This algorithm cannot be implemented in a general purpose operating system because it is impossible to compute reliably how long it will be before a page is going to be used, except when all software that will run on a system is either known beforehand and is amenable to static analysis of its memory reference patterns, or only a class of applications allowing run-time analysis. Despite this limitation, algorithms exist [5] that can offer near-optimal performance — the operating system keeps track of all pages referenced by the program, and it uses those data to decide which pages to swap in and out on subsequent runs. This algorithm can offer near-optimal performance, but not on the first run of a program, and only if the program's memory reference pattern is relatively consistent each time it runs.
Analysis of the paging problem has also been done in the field of online algorithms. Efficiency of randomized online algorithms for the paging problem is measured using amortized analysis.
The not recently used (NRU) page replacement algorithm is an algorithm that favours keeping pages in memory that have been recently used. This algorithm works on the following principle: when a page is referenced, a referenced bit is set for that page, marking it as referenced. Similarly, when a page is modified (written to), a modified bit is set. The setting of the bits is usually done by the hardware, although it is possible to do so on the software level as well.
At a certain fixed time interval, a timer interrupt triggers and clears the referenced bit of all the pages, so only pages referenced within the current timer interval are marked with a referenced bit. When a page needs to be replaced, the operating system divides the pages into four classes:
Although it does not seem possible for a page to be modified yet not referenced, this happens when a class 3 page has its referenced bit cleared by the timer interrupt. The NRU algorithm picks a random page from the lowest category for removal. So out of the above four page categories, the NRU algorithm will replace a not-referenced, not-modified page if such a page exists. Note that this algorithm implies that a modified but not-referenced (within the last timer interval) page is less important than a not-modified page that is intensely referenced.
NRU is a marking algorithm, so it is -competitive.
The simplest page-replacement algorithm is a FIFO algorithm. The first-in, first-out (FIFO) page replacement algorithm is a low-overhead algorithm that requires little bookkeeping on the part of the operating system. The idea is obvious from the name – the operating system keeps track of all the pages in memory in a queue, with the most recent arrival at the back, and the oldest arrival in front. When a page needs to be replaced, the page at the front of the queue (the oldest page) is selected. While FIFO is cheap and intuitive, it performs poorly in practical application. Thus, it is rarely used in its unmodified form. This algorithm experiences Bélády's anomaly. In simple words, on a page fault, the frame that has been in memory the longest is replaced.
FIFO page replacement algorithm is used by the OpenVMS operating system, with some modifications. [6] Partial second chance is provided by skipping a limited number of entries with valid translation table references, [7] and additionally, pages are displaced from process working set to a systemwide pool from which they can be recovered if not already re-used.
FIFO is a conservative algorithm, so it is -competitive.
A modified form of the FIFO page replacement algorithm, known as the Second-chance page replacement algorithm, fares relatively better than FIFO at little cost for the improvement. It works by looking at the front of the queue as FIFO does, but instead of immediately paging out that page, it checks to see if its referenced bit is set. If it is not set, the page is swapped out. Otherwise, the referenced bit is cleared, the page is inserted at the back of the queue (as if it were a new page) and this process is repeated. This can also be thought of as a circular queue. If all the pages have their referenced bit set, on the second encounter of the first page in the list, that page will be swapped out, as it now has its referenced bit cleared. If all the pages have their reference bit cleared, then second chance algorithm degenerates into pure FIFO.
As its name suggests, Second-chance gives every page a "second-chance" – an old page that has been referenced is probably in use, and should not be swapped out over a new page that has not been referenced.
Clock is a more efficient version of FIFO than Second-chance because pages don't have to be constantly pushed to the back of the list, but it performs the same general function as Second-Chance. The clock algorithm keeps a circular list of pages in memory, with the "hand" (iterator) pointing to the last examined page frame in the list. When a page fault occurs and no empty frames exist, then the R (referenced) bit is inspected at the hand's location. If R is 0, the new page is put in place of the page the "hand" points to, and the hand is advanced one position. Otherwise, the R bit is cleared, then the clock hand is incremented and the process is repeated until a page is replaced. [8] This algorithm was first described in 1969 by Fernando J. Corbató. [9]
CLOCK is a conservative algorithm, so it is -competitive.
The least recently used (LRU) page replacement algorithm, though similar in name to NRU, differs in the fact that LRU keeps track of page usage over a short period of time, while NRU just looks at the usage in the last clock interval. LRU works on the idea that pages that have been most heavily used in the past few instructions are most likely to be used heavily in the next few instructions too. While LRU can provide near-optimal performance in theory (almost as good as adaptive replacement cache), it is rather expensive to implement in practice. There are a few implementation methods for this algorithm that try to reduce the cost yet keep as much of the performance as possible.
The most expensive method is the linked list method, which uses a linked list containing all the pages in memory. At the back of this list is the least recently used page, and at the front is the most recently used page. The cost of this implementation lies in the fact that items in the list will have to be moved about every memory reference, which is a very time-consuming process.
Another method that requires hardware support is as follows: suppose the hardware has a 64-bit counter that is incremented at every instruction. Whenever a page is accessed, it acquires the value equal to the counter at the time of page access. Whenever a page needs to be replaced, the operating system selects the page with the lowest counter and swaps it out.
Because of implementation costs, one may consider algorithms (like those that follow) that are similar to LRU, but which offer cheaper implementations.
One important advantage of the LRU algorithm is that it is amenable to full statistical analysis. It has been proven, for example, that LRU can never result in more than N-times more page faults than OPT algorithm, where N is proportional to the number of pages in the managed pool.
On the other hand, LRU's weakness is that its performance tends to degenerate under many quite common reference patterns. For example, if there are N pages in the LRU pool, an application executing a loop over array of N + 1 pages will cause a page fault on each and every access. As loops over large arrays are common, much effort has been put into modifying LRU to work better in such situations. Many of the proposed LRU modifications try to detect looping reference patterns and to switch into suitable replacement algorithm, like Most Recently Used (MRU).
A comparison of ARC with other algorithms (LRU, MQ, 2Q, LRU-2, LRFU, LIRS) can be found in Megiddo & Modha 2004. [19]
LRU is a marking algorithm, so it is -competitive.
Random replacement algorithm replaces a random page in memory. This eliminates the overhead cost of tracking page references. Usually it fares better than FIFO, and for looping memory references it is better than LRU, although generally LRU performs better in practice. OS/390 uses global LRU approximation and falls back to random replacement when LRU performance degenerates, and the Intel i860 processor used a random replacement policy (Rhodehamel 1989 [20] ).
The not frequently used (NFU) page replacement algorithm requires a counter, and every page has one counter of its own which is initially set to 0. At each clock interval, all pages that have been referenced within that interval will have their counter incremented by 1. In effect, the counters keep track of how frequently a page has been used. Thus, the page with the lowest counter can be swapped out when necessary.
The main problem with NFU is that it keeps track of the frequency of use without regard to the time span of use. Thus, in a multi-pass compiler, pages which were heavily used during the first pass, but are not needed in the second pass will be favoured over pages which are comparably lightly used in the second pass, as they have higher frequency counters. This results in poor performance. Other common scenarios exist where NFU will perform similarly, such as an OS boot-up. Thankfully, a similar and better algorithm exists, and its description follows.
The not frequently used page-replacement algorithm generates fewer page faults than the least recently used page replacement algorithm when the page table contains null pointer values.
The aging algorithm is a descendant of the NFU algorithm, with modifications to make it aware of the time span of use. Instead of just incrementing the counters of pages referenced, putting equal emphasis on page references regardless of the time, the reference counter on a page is first shifted right (divided by 2), before adding the referenced bit to the left of that binary number. For instance, if a page has referenced bits 1,0,0,1,1,0 in the past 6 clock ticks, its referenced counter will look like this in chronological order: 10000000, 01000000, 00100000, 10010000, 11001000, 01100100. Page references closer to the present time have more impact than page references long ago. This ensures that pages referenced more recently, though less frequently referenced, will have higher priority over pages more frequently referenced in the past. Thus, when a page needs to be swapped out, the page with the lowest counter will be chosen.
The following Python code simulates the aging algorithm. Counters are initialized with 0 and updated as described above via , using arithmetic shift operators.
fromcollections.abcimportSequencedefsimulate_aging(Rs:Sequence,k:int)->None:# Simulate agingprint(" t | R-bits (0-{length}) | Counters for pages 0-{length}".format(length=len(Rs)))Vs=[0]*len(Rs[0])fort,Rinenumerate(Rs):Vs[:]=[R[i]<<(k-1)|V>>1fori,Vinenumerate(Vs)]print("{:02d} | {} | [{}]".format(t,R,", ".join(["{:0{}b}".format(V,k)forVinVs])))
In the given example of R-bits for 6 pages over 5 clock ticks, the function prints the following output, which lists the R-bits for each clock tick t and the individual counter values for each page in binary representation. [21]
>>> Rs=[[1,0,1,0,1,1],[1,1,0,0,1,0],[1,1,0,1,0,1],[1,0,0,0,1,0],[0,1,1,0,0,0]]>>> k=8>>> simulate_aging(Rs,k) t | R-bits (0-5) | Counters for pages 0-500 | [1, 0, 1, 0, 1, 1] | [10000000, 00000000, 10000000, 00000000, 10000000, 10000000]01 | [1, 1, 0, 0, 1, 0] | [11000000, 10000000, 01000000, 00000000, 11000000, 01000000]02 | [1, 1, 0, 1, 0, 1] | [11100000, 11000000, 00100000, 10000000, 01100000, 10100000]03 | [1, 0, 0, 0, 1, 0] | [11110000, 01100000, 00010000, 01000000, 10110000, 01010000]04 | [0, 1, 1, 0, 0, 0] | [01111000, 10110000, 10001000, 00100000, 01011000, 00101000]
Note that aging differs from LRU in the sense that aging can only keep track of the references in the latest 16/32 (depending on the bit size of the processor's integers) time intervals. Consequently, two pages may have referenced counters of 00000000, even though one page was referenced 9 intervals ago and the other 1000 intervals ago. Generally speaking, knowing the usage within the past 16 intervals is sufficient for making a good decision as to which page to swap out. Thus, aging can offer near-optimal performance for a moderate price.
The basic idea behind this algorithm is Locality of Reference as used in LRU but the difference is that in LDF, locality is based on distance not on the used references. In the LDF, replace the page which is on longest distance from the current page. If two pages are on same distance then the page which is next to current page in anti-clock rotation will get replaced.[ citation needed ]
Many of the techniques discussed above assume the presence of a reference bit associated with each page. Some hardware has no such bit, so its efficient use requires techniques that operate well without one.
One notable example is VAX hardware running OpenVMS. This system knows if a page has been modified, but not necessarily if a page has been read. Its approach is known as Secondary Page Caching. Pages removed from working sets (process-private memory, generally) are placed on special-purpose lists while remaining in physical memory for some time. Removing a page from a working set is not technically a page-replacement operation, but effectively identifies that page as a candidate. A page whose backing store is still valid (whose contents are not dirty, or otherwise do not need to be preserved) is placed on the tail of the Free Page List. A page that requires writing to backing store will be placed on the Modified Page List. These actions are typically triggered when the size of the Free Page List falls below an adjustable threshold.
Pages may be selected for working set removal in an essentially random fashion, with the expectation that if a poor choice is made, a future reference may retrieve that page from the Free or Modified list before it is removed from physical memory. A page referenced this way will be removed from the Free or Modified list and placed back into a process working set. The Modified Page List additionally provides an opportunity to write pages out to backing store in groups of more than one page, increasing efficiency. These pages can then be placed on the Free Page List. The sequence of pages that works its way to the head of the Free Page List resembles the results of a LRU or NRU mechanism and the overall effect has similarities to the Second-Chance algorithm described earlier.
Another example is used by the Linux kernel on ARM. The lack of hardware functionality is made up for by providing two page tables – the processor-native page tables, with neither referenced bits nor dirty bits, and software-maintained page tables with the required bits present. The emulated bits in the software-maintained table are set by page faults. In order to get the page faults, clearing emulated bits in the second table revokes some of the access rights to the corresponding page, which is implemented by altering the native table.
Linux uses a unified page cache for
brk
and anonymous mmap
ed-regions. This includes the heap and stack of user-space programs. It is written to swap when paged out.mmap
ed regions. If present in memory and not privately modified the physical page is shared with file cache or buffer.shm_open
.The unified page cache operates on units of the smallest page size supported by the CPU (4 KiB in ARMv8, x86 and x86-64) with some pages of the next larger size (2 MiB in x86-64) called "huge pages" by Linux. The pages in the page cache are divided in an "active" set and an "inactive" set. Both sets keep a LRU list of pages. In the basic case, when a page is accessed by a user-space program it is put in the head of the inactive set. When it is accessed repeatedly, it is moved to the active list. Linux moves the pages from the active set to the inactive set as needed so that the active set is smaller than the inactive set. When a page is moved to the inactive set it is removed from the page table of any process address space, without being paged out of physical memory. [22] [23] When a page is removed from the inactive set, it is paged out of physical memory. The size of the "active" and "inactive" list can be queried from /proc/meminfo
in the fields "Active", "Inactive", "Active(anon)", "Inactive(anon)", "Active(file)" and "Inactive(file)".
The working set of a process is the set of pages expected to be used by that process during some time interval.
The "working set model" isn't a page replacement algorithm in the strict sense (it's actually a kind of medium-term scheduler)[ clarification needed ]
In computing, virtual memory, or virtual storage, is a memory management technique that provides an "idealized abstraction of the storage resources that are actually available on a given machine" which "creates the illusion to users of a very large (main) memory".
In computing, scheduling is the action of assigning resources to perform tasks. The resources may be processors, network links or expansion cards. The tasks may be threads, processes or data flows.
In computer operating systems, memory paging is a memory management scheme by which a computer stores and retrieves data from secondary storage for use in main memory. In this scheme, the operating system retrieves data from secondary storage in same-size blocks called pages. Paging is an important part of virtual memory implementations in modern operating systems, using secondary storage to let programs exceed the size of available physical memory.
A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. It is a part of the chip's memory-management unit (MMU). A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.
In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response, or by returning the value read from the memory location.
In computer science, thrashing occurs in a system with virtual memory when a computer's real storage resources are overcommitted, leading to a constant state of paging and page faults, slowing most application-level processing. This causes the performance of the computer to degrade or collapse. The situation can continue indefinitely until the user closes some running applications or the active processes free up additional virtual memory resources.
In computing, cache replacement policies are optimizing instructions or algorithms which a computer program or hardware-maintained structure can utilize to manage a cache of information. Caching improves performance by keeping recent or often-used data items in memory locations which are faster, or computationally cheaper to access, than normal memory stores. When the cache is full, the algorithm must choose which items to discard to make room for new data.
In concurrent programming, an operation is linearizable if it consists of an ordered list of invocation and response events, that may be extended by adding response events such that:
The THE multiprogramming system or THE OS was a computer operating system designed by a team led by Edsger W. Dijkstra, described in monographs in 1965-66 and published in 1968. Dijkstra never named the system; "THE" is simply the abbreviation of "Technische Hogeschool Eindhoven", then the name of the Eindhoven University of Technology of the Netherlands. The THE system was primarily a batch system that supported multitasking; it was not designed as a multi-user operating system. It was much like the SDS 940, but "the set of processes in the THE system was static".
In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.
Pseudo-LRU or PLRU is a family of cache algorithms which improve on the performance of the Least Recently Used (LRU) algorithm by replacing values using approximate measures of age rather than maintaining the exact age of every value in the cache.
Working set is a concept in computer science which defines the amount of memory that a process requires in a given time interval.
In computer storage, Bélády's anomaly is the phenomenon in which increasing the number of page frames results in an increase in the number of page faults for certain memory access patterns. This phenomenon is commonly experienced when using the first-in first-out (FIFO) page replacement algorithm. In FIFO, the page fault may or may not increase as the page frames increase, but in optimal and stack-based algorithms like LRU, as the page frames increase, the page fault decreases. László Bélády demonstrated this in 1969.
Adaptive Replacement Cache (ARC) is a page replacement algorithm with better performance than LRU. This is accomplished by keeping track of both frequently used and recently used pages plus a recent eviction history for both. The algorithm was developed at the IBM Almaden Research Center. In 2006, IBM was granted a patent for the adaptive replacement cache policy.
Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.
In applied mathematics, a bit-reversal permutation is a permutation of a sequence of items, where is a power of two. It is defined by indexing the elements of the sequence by the numbers from to , representing each of these numbers by its binary representation, and mapping each item to the item whose representation has the same bits in the reversed order.
LIRS is a page replacement algorithm with an improved performance over LRU and many other newer replacement algorithms. This is achieved by using "reuse distance" as the locality metric for dynamically ranking accessed pages to make a replacement decision.
A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.
/mm/workingset.c
in the Linux source