Machine-check exception

Last updated

A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.

Contents

The nature and causes of MCEs can vary by architecture and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a reboot. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by ECC memory. On some architectures, such as PowerPC, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as x86, MCEs typically originate from hardware only.

Reporting

IBM mainframe operating systems

IBM System/360 Operating System (OS/360) records input/output errors in a dataset called SYS1.LOGREC. Since then IBM has coined the term error recording data set (ERDS) for successor versions that allow the installation to choose the name and for operating systems not derived from OS/360. [1]

OS/360

In OS/360, the installation can choose several levels of support for handling machine checks. The most sophisticated, Machine Check Handler (MCH), records failure data on SYS1.LOGREC and attempts recovery. The installation can print those data using the Environmental Record Editing and Printing Program (EREP) service aid or the stand-alone version SEREP. The MCH can handle memory failures in refreshable nucleus control sections by reading a fresh copy from SYS1.ASRLIB and can handle memory errors in SVC transient areas by reading a fresh copy of the SVC module from SYS1.SVCLIB.

z/OS

In z/OS the installation can either use an ERDS or can define a z/OS System Logger log stream [2] to hold the error data. As with OS/360, the installation uses EREP to print those data; SEREP is no longer available. The MCH is no longer optional, and handles many more failure modes than the OS/360 MCH.

Microsoft Windows

On Microsoft Windows platforms, in the event of an unrecoverable MCEs, the system generates a BugCheck — also called a STOP error, or a Blue Screen of Death.

More recent versions of Windows use the Windows Hardware Error Architecture (WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parentheses) will vary, but the first is always 0x0 for an MCE. [3] Example:

   STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)

Older versions of Windows use the Machine Check Architecture, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION. [4] Example:

   STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA)

Linux

On Linux, the kernel writes messages about MCEs to the kernel message log and the system console. When the MCEs are not fatal, they will also typically be copied to the system log and/or systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities. [5]

Example:

   CPU 0: Machine Check Exception: 0000000000000004    Bank 2: f200200000000863    Kernel panic: CPU context corrupt

Problem types

Some of the main hardware problems that cause MCEs include:

Possible causes

Machine checks are a hardware problem, not a software problem. They are often the result of overclocking or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include:

Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like memtest86. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.

Decoding MCEs

For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual [6] Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions. [7]

Programs to decode Intel and AMD MCEs

See also

Related Research Articles

<span class="mw-page-title-main">Interrupt</span> Signal to a computer processor emitted by hardware or software

In digital computers, an interrupt is a request for the processor to interrupt currently executing code, so that the event can be processed in a timely manner. If the request is accepted, the processor will suspend its current activities, save its state, and execute a function called an interrupt handler to deal with the event. This interruption is often temporary, allowing the software to resume normal activities after the interrupt handler finishes, although the interrupt could instead indicate a fatal error.

<span class="mw-page-title-main">Operating system</span> Software that manages computer hardware resources

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.

<span class="mw-page-title-main">System call</span> Way for programs to access kernel services

In computing, a system call is the programmatic way in which a computer program requests a service from the operating system on which it is executed. This may include hardware-related services, creation and execution of new processes, and communication with integral kernel services such as process scheduling. System calls provide an essential interface between a process and the operating system.

In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit central processing units (CPU) and arithmetic logic units (ALU) are those that are based on processor registers, address buses, or data buses of that size. A computer that uses such a processor is a 64-bit computer.

x86-64 Type of instruction set which is a 64-bit version of the x86 instruction set

x86-64 is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.

<span class="mw-page-title-main">Crash (computing)</span> When a computer program stops functioning properly and self-terminates

In computing, a crash, or system crash, occurs when a computer program such as a software application or an operating system stops functioning properly and exits. On some operating systems or individual applications, a crash reporting service will report the crash and any details relating to it, usually to the developer(s) of the application. If the program is a critical part of the operating system, the entire system may crash or hang, often resulting in a kernel panic or fatal system error.

<span class="mw-page-title-main">General protection fault</span>

A general protection fault (GPF) in the x86 instruction set architectures (ISAs) is a fault initiated by ISA-defined protection mechanisms in response to an access violation caused by some running code, either in the kernel or a user program. The mechanism is first described in Intel manuals and datasheets for the Intel 80286 CPU, which was introduced in 1983; it is also described in section 9.8.13 in the Intel 80386 programmer's reference manual from 1986. A general protection fault is implemented as an interrupt. Some operating systems may also classify some exceptions not related to access violations, such as illegal opcode exceptions, as general protection faults, even though they have nothing to do with memory protection. If a CPU detects a protection violation, it stops executing the code and sends a GPF interrupt. In most cases, the operating system removes the failing process from the execution queue, signals the user, and continues executing other processes. If, however, the operating system fails to catch the general protection fault, i.e. another protection violation occurs before the operating system returns from the previous GPF interrupt, the CPU signals a double fault, stopping the operating system. If yet another failure occurs, the CPU is unable to recover; since 80286, the CPU enters a special halt state called "Shutdown", which can only be exited through a hardware reset. The IBM PC AT, the first PC-compatible system to contain an 80286, has hardware that detects the Shutdown state and automatically resets the CPU when it occurs. All descendants of the PC AT do the same, so in a PC, a triple fault causes an immediate system reset.

The High Precision Event Timer (HPET) is a hardware timer available in modern x86-compatible personal computers. Compared to older types of timers available in the x86 architecture, HPET allows more efficient processing of highly timing-sensitive applications, such as multimedia playback and OS task switching. It was developed jointly by Intel and Microsoft and has been incorporated in PC chipsets since 2005. Formerly referred to by Intel as a Multimedia Timer, the term HPET was selected to avoid confusion with the software multimedia timers introduced in the MultiMedia Extensions to Windows 3.0.

System Management Mode is an operating mode of x86 central processor units (CPUs) in which all normal execution, including the operating system, is suspended. An alternate software system which usually resides in the computer's firmware, or a hardware-assisted debugger, is then executed with high privileges.

<span class="mw-page-title-main">Protection ring</span> Layer of protection in computer systems

In computer science, hierarchical protection domains, often called protection rings, are mechanisms to protect data and functionality from faults and malicious behavior.

In computer security, executable-space protection marks memory regions as non-executable, such that an attempt to execute machine code in these regions will cause an exception. It makes use of hardware features such as the NX bit, or in some cases software emulation of those features. However, technologies that emulate or supply an NX bit will usually impose a measurable overhead while using a hardware-supplied NX bit imposes no measurable overhead.

ntoskrnl.exe, also known as the kernel image, contains the kernel and executive layers of the Microsoft Windows NT kernel, and is responsible for hardware abstraction, process handling, and memory management. In addition to the kernel and executive mentioned earlier, it contains the cache manager, security reference monitor, memory manager, scheduler (Dispatcher), and blue screen of death.

<span class="mw-page-title-main">ECC memory</span> Self-correcting computer data storage

Error correction code memory is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption cannot be tolerated, like industrial control applications, critical databases, and infrastructural memory caches.

The following is a timeline of virtualization development. In computing, virtualization is the use of a computer to simulate another computer. Through virtualization, a host simulates a guest by exposing virtual hardware devices, which may be done through software or by allowing access to a physical device connected to the machine.

<span class="mw-page-title-main">VMware ESXi</span> Enterprise-class, type-1 hypervisor for deploying and serving virtual computers

VMware ESXi is an enterprise-class, type-1 hypervisor developed by VMware for deploying and serving virtual computers. As a type-1 hypervisor, ESXi is not a software application that is installed on an operating system (OS); instead, it includes and integrates vital OS components, such as a kernel.

In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system.

In computing, the term 3 GB barrier refers to a limitation of some 32-bit operating systems running on x86 microprocessors. It prevents the operating systems from using all of 4 GiB (4 × 10243 bytes) of main memory. The exact barrier varies by motherboard and I/O device configuration, particularly the size of video RAM; it may be in the range of 2.75 GB to 3.5 GB. The barrier is not present with a 64-bit processor and 64-bit operating system, or with certain x86 hardware and an operating system such as Linux or certain versions of Windows Server and macOS that allow use of Physical Address Extension (PAE) mode on x86 to access more than 4 GiB of RAM.

<span class="mw-page-title-main">GPU switching</span> Mechanism for computers with multiple graphic controllers

GPU switching is a mechanism used on computers with multiple graphic controllers. This mechanism allows the user to either maximize the graphic performance or prolong battery life by switching between the graphic cards. It is mostly used on gaming laptops which usually have an integrated graphic device and a discrete video card.

Windows Hardware Error Architecture (WHEA) is an operating system hardware error handling mechanism introduced with Windows Vista SP1 and Windows Server 2008 as a successor to Machine Check Architecture (MCA) on previous versions of Windows. The architecture consists of several software components that interact with the hardware and firmware of a given platform to handle and notify regarding hardware error conditions. Collectively, these components provide: a generic means of discovering errors, a common error report format for those errors, a way of preserving error records, and an error event model based up on Event Tracing for Windows (ETW).

<span class="mw-page-title-main">Meltdown (security vulnerability)</span> Microprocessor security vulnerability

Meltdown is one of the two original transient execution CPU vulnerabilities. Meltdown affects Intel x86 microprocessors, IBM POWER processors, and some ARM-based microprocessors. It allows a rogue process to read all memory, even when it is not authorized to do so.

References

  1. "Chapter 1. Introducing EREP" (PDF). Environmental Record Editing and Printing Program (EREP) 3.5 - User's Guide (PDF). IBM. September 30, 2021. p. 1. GC35-0151-50. Retrieved February 20, 2023.
  2. System Programmer's Guide to: z/OS System Logger (PDF). July 2007. SG24-6898-01. Retrieved February 20, 2023.{{cite book}}: |work= ignored (help)
  3. "Bug Check 0x124: WHEA_UNCORRECTABLE_ERROR". Microsoft. 2022-11-03. Retrieved 2022-12-11.
  4. "Bug Check 0x9C: MACHINE_CHECK_EXCEPTION". Microsoft. 2021-12-14. Retrieved 2022-12-11.
  5. "mcelog not working with AMD processor family 16 and above on SLES11 SP3". SuSE. 2022-09-27. Retrieved 2022-12-11.
  6. "Machine Check Architecture". Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2. Intel Corporation. November 2018.
  7. "Stop error message in Windows XP that you may receive: "0x0000009C (0x00000004, 0x00000000, 0xb2000000, 0x00020151)"". MSDN. 2015-12-07. Retrieved 2017-07-13.
  8. Mauro Carvalho Chehab (mchehab) (2023-02-20). "rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool". github.com. Retrieved 2023-02-20.
  9. 1 2 "Machine-check exception". wiki.archlinux.org. 2021-05-08. Retrieved 2023-02-21.
  10. 1 2 "ECC RAM". wiki.gentoo.org. 2022-12-30. Retrieved 2023-02-21.
  11. 1 2 "x86/mce: Factor out and deprecate the /dev/mcelog driver". git.kernel.org. 2017-03-28. Retrieved 2023-02-21.
  12. 1 2 "x86/mce: Factor out and deprecate the /dev/mcelog driver". github.com/torvalds/linux/. 2017-03-28. Retrieved 2023-02-21.
  13. "mcelog: Advanced hardware error handling for x86 Linux". 2015-04-20. Retrieved 2017-07-13.
  14. "parsemce: Linux Machine check exception handler parser". 2003-07-22. Retrieved 2017-07-13.
  15. mcedaemon on GitHub