NOR flash replacement

Last updated

While flash memory remains one of the most popular storages in embedded systems because of its non-volatility, shock-resistance, small size, and low energy consumption, its application has grown much beyond its original design. Based on its original design, NOR flash memory is designed to store binary code of programs because it supports XIP (eXecute-In-Place) and high performance in read operations, while NAND flash memory is used as a data storage because of its lower price and higher performance in write/erase operations, compared to NOR flash. In recent years, the price of NAND flash has gone down much faster than that of NOR flash. Thus, to reduce the hardware cost ultimately, using NAND flash to replace NOR flash (motivated by a strong market demand) becomes a new trend in embedded-system designs, especially on mobile phones and arcade games.

Contents

Overview

The replacement depends on well-designed management of flash memory, which is carried out by either software on a host system (as a raw medium) or hardware circuits/firmware inside its devices. Here, an efficient prediction mechanism with limited memory-space requirements and an efficient implementation is proposed. The prediction mechanism collects the access patterns of program execution to construct a prediction graph by adopting the working set concept. According to the prediction graph, the prediction mechanism prefetches data (/code) to the SRAM cache, so as to reduce the cache miss rate. Therefore, the performance of the program execution is improved and the read performance gap between NAND and NOR is filled up effectively. Using NAND Flash for boot code requires the use of DRAM to shadow the code. [1]

Another method is the CPU loads a small boot loader from NAND flash, such small boot loader do initialize DRAM using CPU cache as RAM (SRAM), then load a full version of boot loader.

An effective prefetching strategy

Different from the popular caching ideas in the memory hierarchy, this approach aims at an application-oriented caching mechanism, which adopts prediction-assisted prefetching based on given execution traces of applications. The designs of embedded systems are considered with a limited set of applications, such as a set of selected system programs in mobile phones or arcade games of amusement-park machines. Besides, SRAM capacity and computing power are constrained in the implementation.

Hardware architecture

An Architecture for the Performance Improvement of NAND Flash Memory NR2.JPG
An Architecture for the Performance Improvement of NAND Flash Memory

Four essential components are included in the hardware design: host interface, SRAM (cache), NAND flash memory, and control logic. In order to fill up the performance gap between NAND and NOR, SRAM serves as a cache layer for data access over NAND. The host interface is responsible to the communication with the host system via address and data buses. Most importantly, the control logic manages the caching activity and provides the service emulation of NOR flash with NAND flash and SRAM; it must have an intelligent prediction mechanism implemented to improve the system performance. There are two major components in the control logic: The converter emulates NOR flash access over NAND flash with an SRAM cache, where address translation must be done from byte addressing (for NOR) to Logical Block Address (LBA) addressing (for NAND). Note that each 512B/2KB NAND page corresponds to one and four LBA's, respectively. The prefetch procedure tries to prefetch data from NAND to SRAM so that the hit rate of the NOR access is high over SRAM. The procedure should parse and extract the behavior of the target application via a set of collected traces. According to the extracted access patterns from the collected traces, the procedure generates prediction information, referred to as a prediction graph.

Prediction graph

The access pattern of an application execution over NOR (or NAND) consists of a sequence of LBA's. As an application runs for multiple times, the “virtually” complete picture of the possible access pattern of an application execution might appear. Since most application executions are input-dependent or data-driven, there can be more than one subsequent LBA's following a given LBA, where each LBA corresponds to one node in the graph. Nodes with more than one subsequent LBA's are called branch nodes, and the others are called regular nodes. The graph that corresponds to the access patterns is referred to as the prediction graph of the specific application. If pages in NAND flash could be prefetched in an on-time fashion, and there is enough SRAM space for caching, then all data accesses could be done over SRAM.

To save the prediction graph over flash memory with overheads (SRAM capacity) minimized, the subsequent LBA information of each regular node is saved at the spare area of the corresponding page. It is because that the spare area of a page in current implementations has unused space, and the reading of a page usually comes with the reading of its data and spare areas simultaneously. In such a way, the accessing of the subsequent LBA information of a regular node comes with no extra cost. Since a branch node has more than one subsequent LBA's, the spare area of the corresponding page might not have enough free space to store the information. Thus, a branch table is maintained to save the subsequent LBA information of all branch nodes. The starting entry address of the branch table that corresponds to a branch node can be saved at the spare area of the corresponding page. The starting entry records the number of subsequent LBA's of the branch node, and the subsequent LBA's are stored in the entries following the starting entry. The branch table can be saved on flash memory. During the run time, the entire table can be loaded into SRAM for better performance. If there is not enough SRAM space, parts of the table can be loaded in an on-demand fashion.

Prefetch procedure

The objective of the prefetch procedure is to prefetch data from NAND based on a given prediction graph such that most data accesses occur over SRAM. The basic idea is to prefetch data by following the LBA order in the graph. In order to efficiently look up a selected page in the cache, a cyclic queue is adopted in the cache management. Data prefetched from NAND flash is enqueued, while those transferred to the host is dequeued, on the other hand. The prefetch procedure is done in a greedy way: Let P1 be the last prefetched page. If P1 corresponds to a regular node, then the page that corresponds to the subsequent LBA is prefetched. If P1 corresponds to a branch node, then the procedure should prefetch pages by following all possible next LBA links in an equal base and a round-robin way.

2D NAND replacement by 3D NAND

As of 2021, 2D NAND replacement by 3D NAND has commenced. [2] 3D NAND has the advantages of lower cost per bit and the latest controller technology for better reliability. Consequently, even at the low GB level, 3D NAND is the preferred option for code storage. [3] [4]

Implementations

Modern iOS and Android smartphones, such as iPhone 6 and later, are usually not use a dedicated NOR flash in the boot process; in such boot process, the CPU / SoC built-in Boot ROM will find and load boot loader from NAND flash such as eUFS. Such boot process is called "NOR less".

Related Research Articles

<span class="mw-page-title-main">Computer memory</span> Computer component that stores information for immediate use

Computer memory stores information, such as data and programs, for immediate use in the computer. The term memory is often synonymous with the terms RAM,main memory, or primary storage. Archaic synonyms for main memory include core and store.

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

<span class="mw-page-title-main">Kendall Square Research</span> Former American manufacturer of supercomputers

Kendall Square Research (KSR) was a supercomputer company headquartered originally in Kendall Square in Cambridge, Massachusetts in 1986, near Massachusetts Institute of Technology (MIT). It was co-founded by Steven Frank and Henry Burkhardt III, who had formerly helped found Data General and Encore Computer and was one of the original team that designed the PDP-8. KSR produced two models of supercomputer, the KSR1 and KSR2. It went bankrupt in 1994.

<span class="mw-page-title-main">Flash memory</span> Electronic non-volatile computer storage device

Flash memory is an electronic non-volatile computer memory storage medium that can be electrically erased and reprogrammed. The two main types of flash memory, NOR flash and NAND flash, are named for the NOR and NAND logic gates. Both use the same cell design, consisting of floating gate MOSFETs. They differ at the circuit level depending on whether the state of the bit line or word lines is pulled high or low: in NAND flash, the relationship between the bit line and the word lines resembles a NAND gate; in NOR flash, it resembles a NOR gate.

In computer science, self-modifying code is code that alters its own instructions while it is executing – usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance. The term is usually only applied to code where the self-modification is intentional, not in situations where code accidentally modifies itself due to an error such as a buffer overflow.

Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed. If it turns out the work was not needed after all, most changes made by the work are reverted and the results are ignored.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

<span class="mw-page-title-main">Bootloader</span> Software responsible for starting the Computer and Load other software to the CPU memory

A bootloader, also spelled as boot loader or called bootstrap loader, is a computer program that is responsible for booting a computer. If it also provides an interactive menu with multiple boot choices then it's often called a boot manager.

Explicitly parallel instruction computing (EPIC) is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures. It was the basis for Intel and HP development of the Intel Itanium architecture, and HP later asserted that "EPIC" was merely an old term for the Itanium architecture. EPIC permits microprocessors to execute software instructions in parallel by using the compiler, rather than complex on-die circuitry, to control parallel instruction execution. This was intended to allow simple performance scaling without resorting to higher clock frequencies.

A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have a hierarchy of multiple cache levels, with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far the largest part of them by chip area, but SRAM is not always used for all levels, or even any level, sometimes some latter or all levels are implemented with eDRAM.

Reading is an action performed by computers, to acquire data from a source and place it into their volatile memory for processing. Computers may read information from a variety of sources, such as magnetic storage, the Internet, or audio and video input ports. Reading is one of the core functions of a Turing machine.

In computer science, execute in place (XIP) is a method of executing programs directly from long-term storage rather than copying it into RAM. It is an extension of using shared memory to reduce the total amount of memory required.

The Prefetcher is a component of Microsoft Windows which was introduced in Windows XP. It is a component of the Memory Manager that can speed up the Windows boot process and shorten the amount of time it takes to start up programs. It accomplishes this by caching files that are needed by an application to RAM as the application is launched, thus consolidating disk reads and reducing disk seeks. This feature was covered by US patent 6,633,968.

Memory timings or RAM timings describe the timing information of a memory module or the onboard LPDDRx. Due to the inherent qualities of VLSI and microelectronics, memory chips require time to fully execute commands. Executing commands too quickly will result in data corruption and results in system instability. With appropriate time between commands, memory modules/chips can be given the opportunity to fully switch transistors, charge capacitors and correctly signal back information to the memory controller. Because system performance depends on how fast memory can be used, this timing directly affects the performance of the system.

<span class="mw-page-title-main">Read-only memory</span> Electronic memory that cannot be changed

Read-only memory (ROM) is a type of non-volatile memory used in computers and other electronic devices. Data stored in ROM cannot be electronically modified after the manufacture of the memory device. Read-only memory is useful for storing software that is rarely changed during the life of the system, also known as firmware. Software applications, such as video games, for programmable devices can be distributed as plug-in cartridges containing ROM.

<span class="mw-page-title-main">PA-8000</span> HP microprocessor

The PA-8000 (PCX-U), code-named Onyx, is a microprocessor developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC 2.0 instruction set architecture (ISA). It was a completely new design with no circuitry derived from previous PA-RISC microprocessors. The PA-8000 was introduced on 2 November 1995 when shipments began to members of the Precision RISC Organization (PRO). It was used exclusively by PRO members and was not sold on the merchant market. All follow-on PA-8x00 processors are based on the basic PA-8000 processor core.

This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices.

In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler. They may reduce cache pollution, reduce bandwidth requirement, bypass latencies, by providing better control over the working set. Most cache control instructions do not affect the semantics of a program, although some can.

In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performance, and also have implications for the approach to parallelism and distribution of workload in shared memory systems. Further, cache coherency issues can affect multiprocessor performance, which means that certain memory access patterns place a ceiling on parallelism.

Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed. Most modern computer processors have fast and local cache memory in which prefetched data is held until it is required. The source for the prefetch operation is usually main memory. Because of their design, accessing cache memories is typically much faster than accessing main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from main memory. Prefetching can be done with non-blocking cache control instructions.

References

  1. Booting from NAND Flash Memory
  2. https://media.digikey.com/pdf/PCNs/Delkin/EOL_412-0068-00_Rev-A.pdf [ bare URL PDF ]
  3. https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2013/20130814_204C_Thomson.pdf [ bare URL PDF ]
  4. "Greenliant > Solid State Storage > eMMC NANDrive > GLS85VM1002E". www.greenliant.com. Retrieved 2024-07-08.