Master-checker

Last updated

A master-checker is a hardware-supported fault tolerance method for multiprocessor systems, in which two processors, referred to as the master and checker, calculate the same functions in parallel in order to increase the probability that the result is exact. The checker-CPU is synchronised at clock level with the master-CPU and processes the same programs as the master. Whenever the master-CPU generates an output, the checker-CPU compares this output to its own calculation and in the event of a difference raises a warning.

The master-checker system generally gives more accurate answers by ensuring that the answer is correct before passing it on to the application requesting the algorithm being completed. It also allows for error handling if the results are inconsistent. A recurrence of discrepancies between the two processors could indicate a flaw in the software, hardware problems, or timing issues between the clock, CPUs, and/or system memory. However, such redundant processing wastes time and energy. If the master-CPU is correct 95% or more of the time, the power and time used by the checker-CPU to verify answers is wasted. Depending on the merit of a correct answer, a checker-CPU may or may not be warranted. In order to alleviate some of the cost in these situations, the checker-CPU may be used to calculate something else in the same algorithm, increasing the speed and processing output of the CPU system.

Related Research Articles

Central processing unit Central computer component which executes instructions

A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions in the program. This contrasts with external components such as main memory and I/O circuitry, and specialized processors such as graphics processing units (GPUs).

The control unit (CU) is a component of a computer's central processing unit (CPU) that directs the operation of the processor. A CU typically uses a binary decoder to convert coded instructions into timing and control signals that direct the operation of the other units.

Cache (computing) Data storage that enables faster access

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response. Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines".

A real-time operating system (RTOS) is an operating system (OS) intended to serve real-time applications that processes data and events that have critically defined time constraints for the system under control to perform as required. RTOS are distinct from time sharing OS that manages the sharing of system resources through the use of a scheduler, data buffers, or fixed task prioritisation in a multitasking or multiprogramming environment. Processing time requirements need to be fully understood and bound rather than just "kept as a minimum". All processing must be done within the defined constraints or the system will fail. RTOS are event-driven and preemptive, meaning the OS is capable of monitoring the relevant priority of competing tasks dynamically and make changes to the task or ISR priority. Event-driven systems switch between tasks based on their priorities, while time-sharing systems switch the task based on clock interrupts. Most RTOSs use a pre-emptive scheduling algorithm.

Parallel computing Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Overclocking Practice of increasing the clock rate of a computer to exceed that certified by the manufacturer

In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated speeds. Semiconductor devices operated at higher frequencies and voltages increase power consumption and heat. An overclocked device may be unreliable or fail completely if the additional heat load is not removed or power delivery components cannot meet increased power demands. Many device warranties state that overclocking or over-specification voids any warranty, however there are an increasing number of manufacturers that will allow overclocking as long as performed (relatively) safely.

Underclocking, also known as downclocking, is modifying a computer or electronic circuit's timing settings to run at a lower clock rate than is specified. Underclocking is used to reduce a computer's power consumption, increase battery life, reduce heat emission, and it may also increase the system's stability, lifespan/reliability and compatibility. Underclocking may be implemented by the factory, but many computers and components may be underclocked by the end user.

In computer science, program optimization, code optimization, or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. In general, a computer program may be optimized so that it executes more rapidly, or to make it capable of operating with less memory storage or other resources, or draw less power.

In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

Physics engine Software for approximate simulation of physical systems

A physics engine is computer software that provides an approximate simulation of certain physical systems, such as rigid body dynamics, soft body dynamics, and fluid dynamics, of use in the domains of computer graphics, video games and film (CGI). Their main uses are in video games, in which case the simulations are in real-time. The term is sometimes used more generally to describe any software system for simulating physical phenomena, such as high-performance scientific simulation.

Saturation arithmetic is a version of arithmetic in which all operations such as addition and multiplication are limited to a fixed range between a minimum and maximum value.

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

In integrated circuit design, dynamic logic is a design methodology in combinatory logic circuits, particularly those implemented in MOS technology. It is distinguished from the so-called static logic by exploiting temporary storage of information in stray and gate capacitances. It was popular in the 1970s and has seen a recent resurgence in the design of high speed digital electronics, particularly computer CPUs. Dynamic logic circuits are usually faster than static counterparts, and require less surface area, but are more difficult to design. Dynamic logic has a higher toggle rate than static logic but the capacitive loads being toggled are smaller so the overall power consumption of dynamic logic may be higher or lower depending on various tradeoffs. When referring to a particular logic family, the dynamic adjective usually suffices to distinguish the design methodology, e.g. dynamic CMOS or dynamic SOI design.

CPU time Time used by a computer

CPU time is the amount of time for which a central processing unit (CPU) was used for processing instructions of a computer program or operating system, as opposed to elapsed time, which includes for example, waiting for input/output (I/O) operations or entering low-power (idle) mode. The CPU time is measured in clock ticks or seconds. Often, it is useful to measure CPU time as a percentage of the CPU's capacity, which is called the CPU usage. CPU time and CPU usage have two main uses.

In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructions. When it comes to high computer performance, one or more of the following factors might be involved:

Computer hardware Physical components of a computer

Computer hardware includes the physical parts of a computer, such as the case, central processing unit (CPU), monitor, mouse, keyboard, computer data storage, graphics card, sound card, speakers and motherboard.

Arithmetic logic unit Combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers

In computing, an arithmetic logic unit (ALU) is a combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point numbers. It is a fundamental building block of many types of computing circuits, including the central processing unit (CPU) of computers, FPUs, and graphics processing units (GPUs).

Zen+ is the codename for a computer processor microarchitecture by AMD. It is the successor to the first gen Zen microarchitecture, first released in April 2018, powering the second generation of Ryzen processors, known as Ryzen 2000 for mainstream desktop systems, Threadripper 2000 for high-end desktop setups and Ryzen 3000G for accelerated processing units (APUs).

References