Approximate computing

Last updated

Approximate computing is an emerging paradigm for energy-efficient and/or high-performance design. [1] It includes a plethora of computation techniques that return a possibly inaccurate result rather than a guaranteed accurate result, and that can be used for applications where an approximate result is sufficient for its purpose. [2] One example of such situation is for a search engine where no exact answer may exist for a certain search query and hence, many answers may be acceptable. Similarly, occasional dropping of some frames in a video application can go undetected due to perceptual limitations of humans. Approximate computing is based on the observation that in many scenarios, although performing exact computation requires large amount of resources, allowing bounded approximation can provide disproportionate gains in performance and energy, while still achieving acceptable result accuracy.[ clarification needed ] For example, in k-means clustering algorithm, allowing only 5% loss in classification accuracy can provide 50 times energy saving compared to the fully accurate classification.

Contents

The key requirement in approximate computing is that approximation can be introduced only in non-critical data, since approximating critical data (e.g., control operations) can lead to disastrous consequences, such as program crash or erroneous output.

Strategies

Several strategies can be used for performing approximate computing.

Approximate circuits
Approximate arithmetic circuits: [3] adders, [4] [5] multipliers [6] and other logical circuits can reduce hardware overhead. [7] [8] [9] For example, an approximate multi-bit adder can ignore the carry chain and thus, allow all its sub-adders to perform addition operation in parallel. [10] [11]
Approximate storage and memory
Instead of storing data values exactly, they can be stored approximately, e.g., by truncating the lower-bits in floating point data. Another method is to accept less reliable memory. For this, in DRAM [12] and eDRAM, refresh rate assignments can be lowered or controlled. [13] In SRAM, supply voltage can be lowered [14] or controlled. [15] Approximate storage can be applied to reduce MRAM's high write energy consumption. [16] In general, any error detection and correction mechanisms should be disabled.
Software-level approximation
There are several ways to approximate at software level. Memoization or fuzzy memoization (the use of a vector database for approximate retrieval from a cache, i.e. fuzzy caching) can be applied. Some iterations of loops can be skipped (termed as loop perforation) to achieve a result faster. Some tasks can also be skipped, for example when a run-time condition suggests that those tasks are not going to be useful (task skipping). Monte Carlo algorithms and Randomized algorithms trade correctness for execution time guarantees. [17] The computation can be reformulated according to paradigms that allow easily the acceleration on specialized hardware, e.g. a neural processing unit. [18]
Approximate system
In an approximate system, [19] [20] different subsystems of the system such as the processor, memory, sensor, and communication modules are synergistically approximated to obtain a much better system-level Q-E trade-off curve compared to individual approximations to each of the subsystems.

Application areas

Approximate computing has been used in a variety of domains where the applications are error-tolerant, such as multimedia processing, machine learning, signal processing, scientific computing. Therefore, approximate computing is mostly driven by applications that are related to human perception/cognition and have inherent error resilience. Many of these applications are based on statistical or probabilistic computation, such as different approximations can be made to better suit the desired objectives. [21] One notable application in machine learning is that Google is using this approach in their Tensor processing units (TPU, a custom ASIC). [22]

Derived paradigms

The main issue in approximate computing is the identification of the section of the application that can be approximated. In the case of large scale applications, it is very common to find people holding the expertise on approximate computing techniques not having enough expertise on the application domain (and vice versa). In order to solve this problem, programming paradigms [23] have been proposed. They all have in common the clear role separation between application programmer and application domain expert. These approaches allow the spread of the most common optimizations and approximate computing techniques.

See also

Related Research Articles

<span class="mw-page-title-main">Fast Fourier transform</span> O(N log N) discrete Fourier transform algorithm

A fast Fourier transform (FFT) is an algorithm that computes the Discrete Fourier Transform (DFT) of a sequence, or its inverse (IDFT). A Fourier transform converts a signal from its original domain to a representation in the frequency domain and vice versa. The DFT is obtained by decomposing a sequence of values into components of different frequencies. This operation is useful in many fields, but computing it directly from the definition is often too slow to be practical. An FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse factors. As a result, it manages to reduce the complexity of computing the DFT from , which arises if one simply applies the definition of DFT, to , where n is the data size. The difference in speed can be enormous, especially for long data sets where n may be in the thousands or millions. In the presence of round-off error, many FFT algorithms are much more accurate than evaluating the DFT definition directly or indirectly. There are many different FFT algorithms based on a wide range of published theories, from simple complex-number arithmetic to group theory and number theory.

<span class="mw-page-title-main">Theoretical computer science</span> Subfield of computer science and mathematics

Theoretical computer science is a subfield of computer science and mathematics that focuses on the abstract and mathematical foundations of computation.

Checkpointing is a technique that provides fault tolerance for computing systems. It involves saving a snapshot of an application's state, so that it can restart from that point in case of failure. This is particularly important for long-running applications that are executed in failure-prone computing systems.

In computer science and mathematical optimization, a metaheuristic is a higher-level procedure or heuristic designed to find, generate, tune, or select a heuristic that may provide a sufficiently good solution to an optimization problem or a machine learning problem, especially with incomplete or imperfect information or limited computation capacity. Metaheuristics sample a subset of solutions which is otherwise too large to be completely enumerated or otherwise explored. Metaheuristics may make relatively few assumptions about the optimization problem being solved and so may be usable for a variety of problems. Their use is always of interest when exact or other (approximate) methods are not available or are not expedient, either because the calculation time is too long or because, for example, the solution provided is too imprecise.

<span class="mw-page-title-main">Iterative reconstruction</span>

Iterative reconstruction refers to iterative algorithms used to reconstruct 2D and 3D images in certain imaging techniques. For example, in computed tomography an image must be reconstructed from projections of an object. Here, iterative reconstruction techniques are usually a better, but computationally more expensive alternative to the common filtered back projection (FBP) method, which directly calculates the image in a single reconstruction step. In recent research works, scientists have shown that extremely fast computations and massive parallelism is possible for iterative reconstruction, which makes iterative reconstruction practical for commercialization.

General-purpose computing on graphics processing units is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). The use of multiple video cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

The expression computational intelligence (CI) usually refers to the ability of a computer to learn a specific task from data or experimental observation. Even though it is commonly considered a synonym of soft computing, there is still no commonly accepted definition of computational intelligence.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

In computational complexity theory, the unique games conjecture is a conjecture made by Subhash Khot in 2002. The conjecture postulates that the problem of determining the approximate value of a certain type of game, known as a unique game, has NP-hard computational complexity. It has broad applications in the theory of hardness of approximation. If the unique games conjecture is true and P ≠ NP, then for many important problems it is not only impossible to get an exact solution in polynomial time, but also impossible to get a good polynomial-time approximation. The problems for which such an inapproximability result would hold include constraint satisfaction problems, which crop up in a wide variety of disciplines.

<span class="mw-page-title-main">Matching pursuit</span> Multidimensional data algorithm

Matching pursuit (MP) is a sparse approximation algorithm which finds the "best matching" projections of multidimensional data onto the span of an over-complete dictionary . The basic idea is to approximately represent a signal from Hilbert space as a weighted sum of finitely many functions taken from . An approximation with atoms has the form

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Thread Level Speculation (TLS), also known as Speculative Multi-threading, or Speculative Parallelization, is a technique to speculatively execute a section of computer code that is anticipated to be executed later in parallel with the normal execution on a separate independent thread. Such a speculative thread may need to make assumptions about the values of input variables. If these prove to be invalid, then the portions of the speculative thread that rely on these input variables will need to be discarded and squashed. If the assumptions are correct the program can complete in a shorter time provided the thread was able to be scheduled efficiently.

<span class="mw-page-title-main">Spiking neural network</span> Artificial neural network that mimics neurons

Spiking neural networks (SNNs) are artificial neural networks (ANN) that more closely mimic natural neural networks. These models leverage timing of discrete spikes as the main information carrier.

Reverse computation is a software application of the concept of reversible computing.

Stochastic computing is a collection of techniques that represent continuous values by streams of random bits. Complex computations can then be computed by simple bit-wise operations on the streams. Stochastic computing is distinct from the study of randomized algorithms.

<span class="mw-page-title-main">SpiNNaker</span>

SpiNNaker is a massively parallel, manycore supercomputer architecture designed by the Advanced Processor Technologies Research Group (APT) at the Department of Computer Science, University of Manchester. It is composed of 57,600 processing nodes, each with 18 ARM9 processors and 128 MB of mobile DDR SDRAM, totalling 1,036,800 cores and over 7 TB of RAM. The computing platform is based on spiking neural networks, useful in simulating the human brain.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

<span class="mw-page-title-main">David Atienza</span> Spanish physicist and materials scientist

David Atienza Alonso is a Spanish/Swiss scientist in the disciplines of computer and electrical engineering. His research focuses on hardware‐software co‐design and management for energy‐efficient and thermal-aware computing systems, always starting from a system‐level perspective to the actual electronic design. He is a full professor of electrical and computer engineering at the Swiss Federal Institute of Technology in Lausanne (EPFL), Associate Vice President of Research Centers and Platforms, and the head of the Embedded Systems Laboratory (ESL). He is an IEEE Fellow (2016), and an ACM Fellow (2022).

Soft computing is an umbrella term used to describe types of algorithms that produce approximate solutions to unsolvable high-level problems in computer science. Typically, traditional hard-computing algorithms heavily rely on concrete data and mathematical models to produce solutions to problems. Soft computing was coined in the late 20th century. During this period, revolutionary research in three fields greatly impacted soft computing. Fuzzy logic is a computational paradigm that entertains the uncertainties in data by using levels of truth rather than rigid 0s and 1s in binary. Next, neural networks which are computational models influenced by human brain functions. Finally, evolutionary computation is a term to describe groups of algorithm that mimic natural processes such as evolution and natural selection.

Random flip-flop (RFF) is a theoretical concept of a non-sequential logic circuit capable of generating true randomness. By definition, it operates as an "ordinary" edge-triggered clocked flip-flop, except that its clock input acts randomly and with probability p = 1/2. Unlike Boolean circuits, which behave deterministically, random flip-flop behaves non-deterministically. By definition, random flip-flop is electrically compatible with Boolean logic circuits. Together with them, RFF makes up a full set of logic circuits capable of performing arbitrary algorithms, namely to realize Probabilistic Turing machine.

References

  1. J. Han and M. Orshansky, "Approximate computing: An emerging paradigm for energy-efficient design", in the 18th IEEE European Test Symposium, pp. 1-6, 2013.
  2. A. Sampson, et al. "EnerJ: Approximate data types for safe and general low-power computation", In ACM SIGPLAN Notices, vol. 46, no. 6, 2011.
  3. Jiang et al., "Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications", the Proceedings of the IEEE, Vol. 108, No. 12, pp. 2108 - 2135, 2020.
  4. J. Echavarria, et al. "FAU: Fast and Error-Optimized Approximate Adder Units on LUT-Based FPGAs", FPT, 2016.
  5. J. Miao, et al. "Modeling and synthesis of quality-energy optimal approximate adders", ICCAD, 2012
  6. Rehman, Semeen; El-Harouni, Walaa; Shafique, Muhammad; Kumar, Akash; Henkel, Jörg (2016-11-07). Architectural-space exploration of approximate multipliers. ACM. p. 80. doi:10.1145/2966986.2967005. ISBN   9781450344661. S2CID   5326133.
  7. S. Venkataramani, et al. "SALSA: systematic logic synthesis of approximate circuits", DAC, 2012.
  8. J. Miao, et al. "Approximate logic synthesis under general error magnitude and frequency constraints", ICCAD, 2013
  9. R. Hegde et al. "Energy-efficient signal processing via algorithmic noise-tolerance", ISLPED, 1999.
  10. Camus, Vincent; Mei, Linyan; Enz, Christian; Verhelst, Marian (December 2019). "Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing". IEEE Journal on Emerging and Selected Topics in Circuits and Systems. 9 (4): 697–711. Bibcode:2019IJEST...9..697C. doi: 10.1109/JETCAS.2019.2950386 . ISSN   2156-3357. The implementation chosen in this study assumes a rightshifting sequential multiplier as it requires a smaller firststage adder than a left-shifting design, preventing long carry propagation and sign-bit extension.
  11. Nagornov, Nikolay N.; Lyakhov, Pavel A.; Bergerman, Maxim V.; Kalita, Diana I. (2024). "Modern Trends in Improving the Technical Characteristics of Devices and Systems for Digital Image Processing". IEEE Access. 12: 44659–44681. Bibcode:2024IEEEA..1244659N. doi: 10.1109/ACCESS.2024.3381493 . ISSN   2169-3536. Addition and accumulation of high order bits are not performed until the partial product reduction for the next multiplication in the proposed architecture.
  12. Raha, A.; Sutar, S.; Jayakumar, H.; Raghunathan, V. (July 2017). "Quality Configurable Approximate DRAM". IEEE Transactions on Computers. 66 (7): 1172–1187. doi: 10.1109/TC.2016.2640296 . ISSN   0018-9340.
  13. Kim, Yongjune; Choi, Won Ho; Guyot, Cyril; Cassuto, Yuval (December 2019). "On the Optimal Refresh Power Allocation for Energy-Efficient Memories". 2019 IEEE Global Communications Conference (GLOBECOM). Waikoloa, HI, USA: IEEE. pp. 1–6. arXiv: 1907.01112 . doi:10.1109/GLOBECOM38437.2019.9013465. ISBN   978-1-7281-0962-6. S2CID   195776538.
  14. Frustaci, Fabio; Blaauw, David; Sylvester, Dennis; Alioto, Massimo (June 2016). "Approximate SRAMs With Dynamic Energy-Quality Management". IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 24 (6): 2128–2141. doi:10.1109/TVLSI.2015.2503733. ISSN   1063-8210. S2CID   8051173.
  15. Kim, Yongjune; Kang, Mingu; Varshney, Lav R.; Shanbhag, Naresh R. (2018). "Generalized Water-filling for Source-aware Energy-efficient SRAMs". IEEE Transactions on Communications. 66 (10): 4826–4841. arXiv: 1710.07153 . doi:10.1109/TCOMM.2018.2841406. ISSN   0090-6778. S2CID   24512949.
  16. Kim, Yongjune; Jeon, Yoocharn; Choi, Hyeokjin; Guyot, Cyril; Cassuto, Yuval (2022). "Optimizing Write Fidelity of MRAMs by Alternating Water-filling Algorithm". IEEE Transactions on Communications. 70 (9): 5825–5836. doi:10.1109/TCOMM.2022.3190868. ISSN   0090-6778. S2CID   250565077.
  17. C.Alippi, Intelligence for Embedded Systems: a Methodological approach, Springer, 2014, pp. 283
  18. Esmaeilzadeh, Hadi; Sampson, Adrian; Ceze, Luis; Burger, Doug (2012). Neural acceleration for general-purpose approximate programs. 45th Annual IEEE/ACM International Symposium on Microarchitecture. Vancouver, BC: IEEE. pp. 449–460. doi:10.1109/MICRO.2012.48.
  19. Raha, Arnab; Raghunathan, Vijay (2017). "Towards Full-System Energy-Accuracy Tradeoffs". Proceedings of the 54th Annual Design Automation Conference 2017. DAC '17. New York, NY, USA: ACM. pp. 74:1–74:6. doi:10.1145/3061639.3062333. ISBN   9781450349277. S2CID   2503638.
  20. Ghosh, Soumendu Kumar; Raha, Arnab; Raghunathan, Vijay (2023-07-24). "Energy-Efficient Approximate Edge Inference Systems". ACM Transactions on Embedded Computing Systems. 22 (4): 77:1–77:50. doi:10.1145/3589766. ISSN   1539-9087.
  21. Liu, Weiqiang; Lombardi, Fabrizio; Schulte, Michael (Dec 2020). "Approximate Computing: From Circuits to Applications". Proceedings of the IEEE. 108 (12): 2103. doi: 10.1109/JPROC.2020.3033361 .
  22. Liu, Weiqiang; Lombardi, Fabrizio; Schulte, Michael (Dec 2020). "Approximate Computing: From Circuits to Applications". Proceedings of the IEEE. 108 (12): 2104. doi: 10.1109/JPROC.2020.3033361 .
  23. Nguyen, Donald; Lenharth, Andrew; Pingali, Keshav (2013). "A lightweight infrastructure for graph analytics". Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM. pp. 456–471. doi: 10.1145/2517349.2522739 . ISBN   9781450323888.