Machine-check Exception
   HOME

TheInfoList



OR:

A machine check exception (MCE) is a type of
computer error An error message is information displayed when an unforeseen occurs, usually on a computer or other device. On modern operating systems with graphical user interfaces, error messages are often displayed using dialog boxes. Error messages are use ...
that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware. The nature and causes of MCEs can vary by
architecture Architecture is the art and technique of designing and building, as distinguished from the skills associated with construction. It is both the process and the product of sketching, conceiving, planning, designing, and constructing building ...
and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a
reboot In computing, rebooting is the process by which a running computer system is restarted, either intentionally or unintentionally. Reboots can be either a cold reboot (alternatively known as a hard reboot) in which the power to the system is physi ...
. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by
ECC memory Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. ECC memory is used in most computers where data corruption c ...
. On some architectures, such as
PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
, MCEs typically originate from hardware only.


Reporting


Microsoft Windows

On
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
platforms, in the event of an unrecoverable MCEs, the system generates a BugCheck — also called a STOP error, or a Blue Screen of Death. More recent versions of Windows use the
Windows Hardware Error Architecture Windows Hardware Error Architecture (WHEA) is an operating system hardware error handling mechanism introduced with Windows Vista SP1 and Windows Server 2008 as a successor to Machine Check Architecture (MCA) on previous versions of Windows. The ...
(WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parenthesis) will vary, but the is always 0x0 for an MCE. Example: STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000) Older versions of Windows use the
Machine Check Architecture In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system. Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION. Example: STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA)


Linux

On
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
, the
kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...
writes messages about MCEs to the kernel message log and the
system console One meaning of system console, computer console, root console, computer operator, operator's console, or simply console is the text entry and display device for system administration messages, particularly those from the BIOS or boot loader, the ...
. When the MCEs are not fatal, they will also typically be copied to the
system log In computing, logging is the act of keeping a log of events that occur in a computer system, such as problems, errors or just information on current operations. These events may occur in the operating system or in other software. A message or lo ...
and/or systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities. Example: CPU 0: Machine Check Exception: 0000000000000004 Bank 2: f200200000000863 Kernel panic: CPU context corrupt


Problem types

Most of these errors relate specifically to the
Pentium Pentium is a brand used for a series of x86 architecture-compatible microprocessors produced by Intel. The original Pentium processor from which the brand took its name was first released on March 22, 1993. After that, the Pentium II and Pe ...
processor family. Similar errors may occur on other processors and will cause similar problems. Some of the main hardware problems that cause MCEs include: *
System bus A system bus is a single computer bus that connects the major components of a computer system, combining the functions of a data bus to carry information, an address bus to determine where it should be sent or read from, and a control bus to dete ...
errors: (error communicating between the processor and the
motherboard A motherboard (also called mainboard, main circuit board, mb, mboard, backplane board, base board, system board, logic board (only in Apple computers) or mobo) is the main printed circuit board (PCB) in general-purpose computers and other expand ...
). *
Memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...
errors:
parity checking A parity bit, or check bit, is a bit added to a string of binary code. Parity bits are a simple form of error detecting code. Parity bits are generally applied to the smallest units of a communication protocol, typically 8-bit octets (bytes), ...
detects when a memory error has occurred.
Error correction code In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
(ECC) can correct limited memory errors so that processing can continue. *
CPU cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...
errors in the processor.


Possible causes

Machine checks are a hardware problem, not a software problem. They are often the result of
overclocking In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated spe ...
or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include: * Poor CPU cooling due to a CPU heatsink and case fans (or filters) that's clogged with dust or has come loose. *
Overclocking In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated spe ...
beyond the highest clock rate at which the CPU is still reliable. * Failing motherboard. * Failing processor. * Failing memory. * Failing I/O controllers, on either the motherboard or separate cards. * Failing I/O devices. * Inadequate or failing power supply. Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like
memtest86 MemTest86 and Memtest86+ are memory test software programs designed to test and stress test an x86 architecture computer's random-access memory (RAM) for errors, by writing test patterns to most memory addresses, reading back the data, and comp ...
. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.


Decoding MCEs

For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions.


Programs to decode Intel and AMD MCEs

* mcat: A Windows command-line program from
AMD Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...
to decode MCEs from
AMD K8 The AMD K8 Hammer, also code-named SledgeHammer, is a computer processor microarchitecture designed by AMD as the successor to the AMD K7 Athlon microarchitecture. The K8 was the first implementation of the AMD64 64-bit extension to the x86 ins ...
, Family 0x10 and 0x11 processors. * mcelog A
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
daemon by Andi Kleen to handle MCEs for modern x86 processors. mcelog can also decode machine checks. * parsemce a
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
program by Dave Jones to decode MCEs from
AMD K7 Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...
processors. * mced a
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
program by Tim Hockin to gather MCEs from the kernel and alert interested applications. Note that it does not try to interpret the MCE data, it simply alerts other programs.


See also

*
Machine check architecture In computing, Machine Check Architecture (MCA) is an Intel and AMD mechanism in which the CPU reports hardware errors to the operating system. Intel's P6 and Pentium 4 family processors, AMD's K7 and K8 family processors, as well as the Itanium ...
* Blue screen of death *
Kernel panic A kernel panic (sometimes abbreviated as KP) is a safety measure taken by an operating system's kernel upon detecting an internal fatal error in which either it is unable to safely recover or continuing to run the system would have a higher ...


Notes


References


External links


mcelog: Advanced hardware error handling for x86 Linux

parsemce: Linux Machine check exception handler parser
Computer errors {{compu-hardware-stub