Watchdog timer

Last updated
A watchdog timer integrated circuit (Texas Instruments TPS3823). One pin receives the timer restart ("kick" ) signal from the computer; another pin outputs the timeout signal. Watchdog timer IC.jpg
A watchdog timer integrated circuit (Texas Instruments TPS3823). One pin receives the timer restart ("kick" ) signal from the computer; another pin outputs the timeout signal.

A watchdog timer (WDT, or simply a watchdog), sometimes called a computer operating properly timer (COP timer), [1] is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in computers to facilitate automatic correction of temporary hardware faults, and to prevent errant or malevolent software from disrupting system operation.

Contents

During normal operation, the computer regularly restarts the watchdog timer to prevent it from elapsing, or "timing out". If, due to a hardware fault or program error, the computer fails to restart the watchdog, the timer will elapse and generate a timeout signal. The timeout signal is used to initiate corrective actions. The corrective actions typically include placing the computer and associated hardware in a safe state and invoking a computer reboot.

Microcontrollers often include an integrated, on-chip watchdog. In other computers the watchdog may reside in a nearby chip that connects directly to the CPU, or it may be located on an external expansion card in the computer's chassis.

Applications

Watchdog timers are essential in remote, automated systems such as this Mars Exploration Rover NASA Mars Rover.jpg
Watchdog timers are essential in remote, automated systems such as this Mars Exploration Rover

Watchdog timers are commonly found in embedded systems and other computer-controlled equipment where humans cannot easily access the equipment or would be unable to react to faults in a timely manner. In such systems, the computer cannot depend on a human to invoke a reboot if it hangs; it must be self-reliant. For example, remote embedded systems such as space probes are not physically accessible to human operators; these could become permanently disabled if they were unable to autonomously recover from faults. In robots and other automated machines, a fault in the control computer could cause equipment damage or injuries before a human could react, even if the computer is easily accessed. A watchdog timer is usually employed in cases like these.

Watchdog timers are also used to monitor and limit software execution time on a normally functioning computer. For example, a watchdog timer may be used when running untrusted code in a sandbox, to limit the CPU time available to the code and thus prevent some types of denial-of-service attacks. [2] In real-time operating systems, a watchdog timer may be used to monitor a time-critical task to ensure it completes within its maximum allotted time and, if it fails to do so, to terminate the task and report the failure.

Architecture and operation

Restarting

Some watchdog timers only allow kicks during a time window. Kicks occurring outside the window have no effect on the timer and may be treated as faults. WatchdogWindow.png
Some watchdog timers only allow kicks during a time window. Kicks occurring outside the window have no effect on the timer and may be treated as faults.

The act of restarting a watchdog timer is commonly referred to as kicking [lower-alpha 1] the watchdog. [3] [4] Kicking is typically done by writing to a watchdog control port or by setting a particular bit in a register. Alternatively, some tightly coupled [lower-alpha 2] watchdog timers are kicked by executing a special machine language instruction. An example of this is the CLRWDT (clear watchdog timer) instruction found in the instruction set of some PIC microcontrollers.

In computers that are running operating systems, watchdog restarts are usually invoked through a device driver. For example, in the Linux operating system, a user space program will kick the watchdog by interacting with the watchdog device driver, typically by writing a zero character to /dev/watchdog or by calling a KEEPALIVE ioctl. [5] The device driver, which serves to abstract the watchdog hardware from user space programs, may also be used to configure the time-out period and start and stop the timer.

Some watchdog timers will only allow kicks during a specific time window. The window timing is usually relative to the previous kick or, if the watchdog has not yet been kicked, to the moment the watchdog was enabled. The window begins after a delay following the previous kick, and ends after a further delay. If the computer attempts to kick the watchdog before or after the window, the watchdog will not be restarted, and in some implementations this will be treated as a fault and trigger corrective action.

Enabling

Screenshot of wdctl, a program that shows watchdog status Wdctl screenshot.png
Screenshot of wdctl , a program that shows watchdog status

A watchdog timer is said to be enabled when operating and disabled when idle. Upon power-up, a watchdog may be unconditionally enabled or it may be initially disabled and require an external signal to enable it. In the latter case, the enabling signal may be automatically generated by hardware or it may be generated under software control.

When automatically generated, the enabling signal is typically derived from the computer reset signal. In some systems the reset signal is directly used to enable the watchdog. In others, the reset signal is delayed so that the watchdog will become enabled at some later time following the reset. This delay allows time for the computer to boot before the watchdog is enabled. Without this delay, the watchdog would timeout and invoke a subsequent reset before the computer can run its application software — the software which kicks the watchdog — and the system would become stuck in an endless cycle of incomplete reboots.

Single-stage watchdog

Watchdog timers come in many configurations, and many allow their configurations to be altered. For example, the watchdog and CPU may share a common clock signal as shown in the block diagram below, or they may have independent clock signals. A basic watchdog timer has a single timer stage which, upon timeout, typically will reset the CPU:

SimpleWatchdogTimer.gif

Multistage watchdog

Two or more timers are sometimes cascaded to form a multistage watchdog timer, where each timer is referred to as a timer stage, or simply a stage. For example, the block diagram below shows a three-stage watchdog. In a multistage watchdog, only the first stage is kicked by the processor. Upon first stage timeout, a corrective action is initiated and the next stage in the cascade is started. As each subsequent stage times out, it triggers a corrective action and starts the next stage. Upon final stage timeout, a corrective action is initiated, but no other stage is started because the end of the cascade has been reached. Typically, single-stage watchdog timers are used to simply restart the computer, whereas multistage watchdog timers will sequentially trigger a series of corrective actions, with the final stage triggering a computer restart. [4]

Watchdog3stage.gif

Time intervals

Watchdog timers may have either fixed or programmable time intervals. Some watchdog timers allow the time interval to be programmed by selecting from among a few selectable, discrete values. In others, the interval can be programmed to arbitrary values. Typically, watchdog time intervals range from ten milliseconds to a minute or more. In a multistage watchdog, each timer may have its own, unique time interval.

Corrective actions

A watchdog timer may initiate any of several types of corrective action, including maskable interrupt, non-maskable interrupt, hardware reset, fail-safe state activation, power cycling, or combinations of these. Depending on its architecture, the type of corrective action or actions that a watchdog can trigger may be fixed or programmable. Some computers (e.g., PC compatibles) require a pulsed signal to invoke a hardware reset. In such cases, the watchdog typically triggers a hardware reset by activating an internal or external pulse generator, which in turn creates the required reset pulses. [4]

In embedded systems and control systems, watchdog timers are often used to activate fail-safe circuitry. When activated, the fail-safe circuitry forces all control outputs to safe states (e.g., turns off motors, heaters, and high-voltages) to prevent injuries and equipment damage while the fault persists. In a two-stage watchdog, the first timer is often used to activate fail-safe outputs and start the second timer stage; the second stage will reset the computer if the fault cannot be corrected before the timer elapses.

Watchdog timers are sometimes used to trigger the recording of system state information—which may be useful during fault recovery [4] —or debug information (which may be useful for determining the cause of the fault) onto a persistent medium. In such cases, a second timer—which is started when the first timer elapses—is typically used to reset the computer later, after allowing sufficient time for data recording to complete. This allows time for the information to be saved, but ensures that the computer will be reset even if the recording process fails.

WatchdogNmiReset.gif

For example, the above diagram shows a likely configuration for a two-stage watchdog timer. During normal operation the computer regularly kicks Stage1 to prevent a timeout. If the computer fails to kick Stage1 (e.g., due to a hardware fault or programming error), Stage1 will eventually timeout. This event will start the Stage2 timer and, simultaneously, notify the computer (by means of a non-maskable interrupt) that a reset is imminent. Until Stage2 times out, the computer may attempt to record state information, debug information, or both. As a last resort, the computer will be reset upon Stage2 timeout.

Fault detection

A watchdog timer provides automatic detection of catastrophic malfunctions that prevent the computer from kicking it. However, computers often have other, less-severe types of faults which do not interfere with kicking, but which still require watchdog oversight. To support these, a computer system is typically designed so that its watchdog timer will be kicked only if the computer deems the system functional. The computer determines whether the system is functional by conducting one or more fault detection tests and will kick the watchdog only if all tests have passed.[ citation needed ]

In computers that are running an operating system and multiple processes, a single, simple test might be insufficient to guarantee normal operation, as it could fail to detect a subtle fault condition and therefore allow the watchdog to be kicked even though a fault condition exists. For example, in the case of the Linux operating system, a user-space watchdog daemon may simply kick the watchdog periodically without performing any tests. As long as the daemon runs normally, the system will be protected against serious system crashes such as a kernel panic. To detect less severe faults, the daemon [6] can be configured to perform tests that cover resource availability (e.g., sufficient memory and file handles, reasonable CPU time), evidence of expected process activity (e.g., system daemons running, specific files being present or updated), overheating, and network activity, and system-specific test scripts or programs can also be run. [7]

Upon discovery of a failed test, the computer may attempt to perform a sequence of corrective actions under software control, culminating with a software-initiated reboot. If the software fails to invoke a reboot, the watchdog timer will timeout and invoke a hardware reset. In effect, this is a multistage watchdog timer in which the software comprises the first and intermediate timer stages and the hardware reset the final stage. In a Linux system, for example, the watchdog daemon could attempt to perform a software-initiated restart, which can be preferable to a hardware reset as the file systems will be safely unmounted and fault information will be logged. It is essential, however, to have the insurance provided by a hardware timer, since a software restart can fail under a number of fault conditions.[ citation needed ]

See also

Notes

  1. 1 2 Various terms are used for the act of restarting a watchdog timer. Some (e.g. kick, pet, feed, tickle) draw a connection to guard dogs, whereas others (e.g. tag, ping, reset) do not. This article uses kick for consistency.
  2. A tightly coupled watchdog timer is effectively a built-in extension of the processor and, as such, may be accessed by special machine language instructions which are specific to it.

Related Research Articles

<span class="mw-page-title-main">Interrupt</span> Signal to a computer processor emitted by hardware or software

In digital computers, an interrupt is a request for the processor to interrupt currently executing code, so that the event can be processed in a timely manner. If the request is accepted, the processor will suspend its current activities, save its state, and execute a function called an interrupt handler to deal with the event. This interruption is often temporary, allowing the software to resume normal activities after the interrupt handler finishes, although the interrupt could instead indicate a fatal error.

<span class="mw-page-title-main">Embedded system</span> Computer system with a dedicated function

An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is embedded as part of a complete device often including electrical or electronic hardware and mechanical parts. Because an embedded system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints. Embedded systems control many devices in common use. In 2009, it was estimated that ninety-eight percent of all microprocessors manufactured were used in embedded systems.

<span class="mw-page-title-main">Crash (computing)</span> When a computer program stops functioning properly and self-terminates

In computing, a crash, or system crash, occurs when a computer program such as a software application or an operating system stops functioning properly and exits. On some operating systems or individual applications, a crash reporting service will report the crash and any details relating to it, usually to the developer(s) of the application. If the program is a critical part of the operating system, the entire system may crash or hang, often resulting in a kernel panic or fatal system error.

In-circuit emulation (ICE) is the use of a hardware device or in-circuit emulator used to debug the software of an embedded system. It operates by using a processor with the additional ability to support debugging operations, as well as to carry out the main function of the system. Particularly for older systems, with limited processors, this usually involved replacing the processor temporarily with a hardware emulator: a more powerful although more expensive version. It was historically in the form of bond-out processor which has many internal signals brought out for the purpose of debugging. These signals provide information about the state of the processor.

<span class="mw-page-title-main">Dead man's switch</span> Equipment that activates or deactivates upon the incapacitation of operator

A dead man's switch is a switch that is designed to be activated or deactivated if the human operator becomes incapacitated, such as through death, loss of consciousness, or being bodily removed from control. Originally applied to switches on a vehicle or machine, it has since come to be used to describe other intangible uses, as in computer software.

On the x86 computer architecture, a triple fault is a special kind of exception generated by the CPU when an exception occurs while the CPU is trying to invoke the double fault exception handler, which itself handles exceptions occurring while trying to invoke a regular exception handler.

<span class="mw-page-title-main">Timer</span> Type of small clock

A timer or countdown timer is a type of clock that starts from a specified time duration and stops when reaching zero. A simple timer is an hourglass. Commonly, a timer would raise an alarm when it ends. It can be implemented as hardware or software. Stopwatches operate in the opposite direction, upwards from zero, measuring elapsed time since a given time instant. Time switches are timers that control an electric switch.

JTAG is an industry standard for verifying designs and testing printed circuit boards after manufacture.

Signals are standardized messages sent to a running program to trigger specific behavior, such as quitting or error handling. They are a limited form of inter-process communication (IPC), typically used in Unix, Unix-like, and other POSIX-compliant operating systems.

In computing, a non-maskable interrupt (NMI) is a hardware interrupt that standard interrupt-masking techniques in the system cannot ignore. It typically occurs to signal attention for non-recoverable hardware errors. Some NMIs may be masked, but only by using proprietary methods specific to the particular NMI. With regard to SPARC, the non-maskable interrupt (NMI), despite having the highest priority among interrupts, can be prevented from occurring through the use of an interrupt mask.

A built-in self-test (BIST) or built-in test (BIT) is a mechanism that permits a machine to test itself. Engineers design BISTs to meet requirements such as:

Cooperative multitasking, also known as non-preemptive multitasking, is a style of computer multitasking in which the operating system never initiates a context switch from a running process to another process. Instead, in order to run multiple applications concurrently, processes voluntarily yield control periodically or when idle or logically blocked. This type of multitasking is called cooperative because all programs must cooperate for the scheduling scheme to work.

In programming and software design, an event is an action or occurrence recognized by software, often originating asynchronously from the external environment, that may be handled by the software. Computer events can be generated or triggered by the system, by the user, or in other ways. Typically, events are handled synchronously with the program flow; that is, the software may have one or more dedicated places where events are handled, frequently an event loop.

The Linux booting process involves multiple stages and is in many ways similar to the BSD and other Unix-style boot processes, from which it derives. Although the Linux booting process depend very much on the computer architecture, those architectures share similar stages and software components, including system startup, bootloader execution, loading and startup of a Linux kernel image, and execution of various startup scripts and daemons. Those are grouped into 4 steps: system startup, bootloader stage, kernel stage, and init process. When a Linux system is powered up or reset, its processor will execute a specific firmware/program for system initialization, such as Power-on self-test, invoking the reset vector to start a program at a known address in flash/ROM, then load the bootloader into RAM for later execution. In personal computer (PC), not only limited to Linux-distro PC, this firmware/program is called BIOS, which is stored in the mainboard. In embedded Linux system, this firmware/program is called boot ROM. After being loaded into RAM, bootloader will execute to load the second-stage bootloader. The second-stage bootloader will load the kernel image into memory, decompress and initialize it then pass control to this kernel image. Second-stage bootloader also performs several operation on the system such as system hardware check, mounting the root device, loading the necessary kernel modules,... Finally, the very first user-space process starts, and other high-level system initializations are performed.

The Hardware Platform Interface (HPI) is an open specification that defines an application programming interface (API) for platform management of computer systems. The API supports tasks including reading temperature or voltage sensors built into a processor, configuring hardware registers, accessing system inventory information like model numbers and serial numbers, and performing more complex activities, such as upgrading system firmware or diagnosing system failures.

Command Loss Timer Resets are part of the CCSDS communications system to spacecraft either in Earth orbit or beyond Earth orbit.

Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. Fault-tolerant software has the ability to satisfy requirements despite failures.

In computing, rebooting is the process by which a running computer system is restarted, either intentionally or unintentionally. Reboots can be either a cold reboot in which the power to the system is physically turned off and back on again ; or a warm reboot in which the system restarts while still powered up. The term restart is used to refer to a reboot when the operating system closes all programs and finalizes all pending input and output operations before initiating a soft reboot.

In computer science, a heartbeat is a periodic signal generated by hardware or software to indicate normal operation or to synchronize other parts of a computer system. Heartbeat mechanism is one of the common techniques in mission critical systems for providing high availability and fault tolerance of network services by detecting the network or systems failures of nodes or daemons which belongs to a network cluster—administered by a master server—for the purpose of automatic adaptation and rebalancing of the system by using the remaining redundant nodes on the cluster to take over the load of failed nodes for providing constant services. Usually a heartbeat is sent between machines at a regular interval in the order of seconds; a heartbeat message. If the endpoint does not receive a heartbeat for a time—usually a few heartbeat intervals—the machine that should have sent the heartbeat is assumed to have failed. Heartbeat messages are typically sent non-stop on a periodic or recurring basis from the originator's start-up until the originator's shutdown. When the destination identifies a lack of heartbeat messages during an anticipated arrival period, the destination may determine that the originator has failed, shutdown, or is generally no longer available.

A hardware reset or hard reset of a computer system is a hardware operation that re-initializes the core hardware components of the system, thus ending all current software operations in the system. This is typically, but not always, followed by booting of the system into firmware that re-initializes the rest of the system, and restarts the operating system.

References

  1. "4.11 Dual Staged Watchdog Timer". Kontron User's Guide - COMe-cBTi6R. Document Revision 1.0. Kontron. 2021. Archived from the original on 2023-09-23. Retrieved 2023-09-23. p. 39: A watchdog timer (or computer operating properly (COP) timer) is a computer hardware or software timer that triggers a system reset or other corrective action if the main program, due to some fault condition, such as a hang, neglects to regularly service the watchdog (writing a "service pulse" to it, also referred to as "kicking the dog", "petting the dog", "feeding the watchdog" or "triggering the watchdog"). The intention is to bring the system back from the nonresponsive state into normal operation.
  2. "The Grenade Timer: Fortifying the Watchdog Timer Against Malicious Mobile Code" by Frank Stajano and Ross Anderson (2000).
  3. Murphy, Niall & Barr, Michael (October 2001). "Watchdog Timers". Embedded Systems Programming. Retrieved 18 February 2013.
  4. 1 2 3 4 Lamberson, Jim. "Single and Multistage Watchdog Timers" (PDF). Sensoray. Retrieved 10 September 2013.
  5. Weingel, Christer. "The Linux Watchdog driver API" . Retrieved 20 January 2021.
  6. "Watchdog 'man' page" . Retrieved 2013-09-10.
  7. Crawford, Paul (2013-09-05). "Linux Watchdog - General Tests". Archived from the original on 2013-09-14. Retrieved 2013-09-10.

Further reading