Hardware stress test

Last updated

A stress test (sometimes called a torture test) of hardware is a form of deliberately intense and thorough testing used to determine the stability of a given system or entity. It involves testing beyond normal operational capacity, often to a breaking point, in order to observe the results.

Contents

Reasons can include: to determine breaking points and safe usage limits; to confirm that the intended specifications are being met; to search for issues inside of a product; to determine modes of failure (how exactly a system may fail), and to test stable operation of a part or system outside standard usage. Reliability engineers often test items under expected stress or even under accelerated stress in order to determine the operating life of the item or to determine modes of failure. [1]

The term stress test as it relates to hardware (including electronics, physical devices, nuclear power plants, etc.) is likely to have different refined meanings in specific contexts. One example is in materials, see Fatigue (material) .

Hardware stress test

Stress testing, in general, should put computer hardware under exaggerated levels of stress in order to ensure stability when used in a normal environment. These can include extremes of workload, type of task, memory use, thermal load (heat), clock speed, or voltages. Memory and CPU are two components that are commonly stress tested in this way.

There is considerable overlap between stress testing software and benchmarking software, since both seek to assess and measure maximum performance. Of the two, stress testing software aims to test stability by trying to force a system to fail; benchmarking aims to measure and assess the maximum performance possible at a given task or function.

When modifying the operating parameters of a CPU, such as temperature, humidity, overclocking, underclocking, overvolting, and undervolting, it may be necessary to verify if the new parameters (usually CPU core voltage and frequency) are suitable for heavy CPU loads. This is done by running a CPU-intensive program for extended periods of time, to test whether the computer hangs or crashes. CPU stress testing is also referred to as torture testing. Software that is suitable for torture testing should typically run instructions that utilise the entire chip rather than only a few of its units. Stress testing a CPU over the course of 24 hours at 100% load is, in most cases, sufficient to determine that the CPU will function correctly in normal usage scenarios such as in a desktop computer, where CPU usage typically fluctuates at low levels (50% and under).

Hardware stress testing and stability are subjective and may vary according to how the system will be used. A stress test for a system running 24/7 or that will perform error sensitive tasks such as distributed computing or "folding" projects may differ from one that needs to be able to run a single game with a reasonable amount of reliability. For example, a comprehensive guide on overclocking Sandy Bridge found that: [2]

Even though in the past IntelBurnTest was just as good, it seems that something in the SB uArch [Sandy Bridge microarchitecture] is more heavily stressed with Prime95 ... IBT really does pull more power [make greater thermal demands]. But ... Prime95 failed first every time, and it failed when IBT would pass. So same as Sandy Bridge, Prime95 is a better stability tester for Sandy Bridge-E than IBT/LinX.

Stability is subjective; some might call stability enough to run their game, other like folders [folding projects] might need something that is just as stable as it was at stock, and ... would need to run Prime95 for at least 12 hours to a day or two to deem that stable ... There are [bench testers] who really don’t care for stability like that and will just say if it can [complete] a benchmark it is stable enough. No one is wrong and no one is right. Stability is subjective. [But] 24/7 stability is not subjective.

An engineer at ASUS advised in a 2012 article on overclocking an Intel X79 system, that it is important to choose testing software carefully in order to obtain useful results: [3]

Unvalidated stress tests are not advised (such as Prime95 or LinX or other comparable applications). For high grade CPU/IMC and System Bus testing Aida64 is recommended along with general applications usage like PC Mark 7. Aida has an advantage as it is stability test has been designed for the Sandy Bridge E architecture and test specific functions like AES, AVX and other instruction sets that prime and like synthetics do not touch. As such not only does it load the CPU 100% but will also test other parts of CPU not used under applications like Prime 95. Other applications to consider are SiSoft 2012 or Passmark BurnIn. Be advised validation has not been completed using Prime 95 version 26 and LinX (10.3.7.012) and OCCT 4.1.0 beta 1 but once we have internally tested to ensure at least limited support and operation.

Software commonly used in hardware stress testing

Reliability

Hardware Reliability Verification includes temperature and humidity test, mechanical vibration test, shock test, collision test, drop test, dustproof and waterproof test, and other environmental reliability tests. [4] [5]

Growth in safety-critical applications for automotive electronics significantly increases the IC design reliability challenge. [6] [7]

Hardware Testing of Electric Hot Water Heaters Providing Energy Storage and Demand Response Through Model Predictive Control is from Institute of Electrical and Electronics Engineers, written by Halamay, D.A., Starrett, M and Brekken, T.K.A. The author first discusses that a classical steady state model commonly used for simulation of electric hot water heaters can be inaccurate. Then this paper presents results from hardware testing which demonstrate that systems of water heaters under Model Predictive Control can be reliably dispatched to deliver set-point levels of power to within 2% error. Then the  author presents experiment result which shows a promising pathway to control hot water heaters as energy storage systems is  capable of delivering flexible capacity and fast acting ancillary services on a firm basis.

Advanced Circuit Reliability Verification for Robust Design, a journal discuss the models used on circuit reliability verification and application of these models. It first discusses how the growth in safety-critical applications for automotive electronics significant increases the IC design reliability challenge. Then the author starts to discuss the latest Synopsys' AMS solution for robust design. This part of the article is very technical, mostly talking about how AMS can strengthen the reliability for full-chip mixed-signal verification. This article can be a useful source for investigating why it is important to focus more on reliability verification nowadays.

See also

Related Research Articles

In software quality assurance, performance testing is in general a testing practice performed to determine how a system performs in terms of responsiveness and stability under a particular workload. It can also serve to investigate, measure, validate or verify other quality attributes of the system, such as scalability, reliability and resource usage.

<span class="mw-page-title-main">Embedded system</span> Computer system with a dedicated function

An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is embedded as part of a complete device often including electrical or electronic hardware and mechanical parts. Because an embedded system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints. Embedded systems control many devices in common use. In 2009, it was estimated that ninety-eight percent of all microprocessors manufactured were used in embedded systems.

<span class="mw-page-title-main">Overclocking</span> Practice of increasing the clock rate of a computer to exceed that certified by the manufacturer

In computing, overclocking is the practice of increasing the clock rate of a computer to exceed that certified by the manufacturer. Commonly, operating voltage is also increased to maintain a component's operational stability at accelerated speeds. Semiconductor devices operated at higher frequencies and voltages increase power consumption and heat. An overclocked device may be unreliable or fail completely if the additional heat load is not removed or power delivery components cannot meet increased power demands. Many device warranties state that overclocking or over-specification voids any warranty, but some manufacturers allow overclocking as long as it is done (relatively) safely.

Underclocking, also known as downclocking, is modifying a computer or electronic circuit's timing settings to run at a lower clock rate than is specified. Underclocking is used to reduce a computer's power consumption, increase battery life, reduce heat emission, and it may also increase the system's stability, lifespan/reliability and compatibility. Underclocking may be implemented by the factory, but many computers and components may be underclocked by the end user.

In computing, stress testing can be applied to either hardware or software. It is used to determine the maximum capability of a computer system and is often used for purposes such as scaling for production use and ensuring reliability and stability. Stress tests typically involve running a large amount of resource-intensive processes until the system either crashes or nearly does so.

In computing, the clock rate or clock speed typically refers to the frequency at which the clock generator of a processor can generate pulses, which are used to synchronize the operations of its components, and is used as an indicator of the processor's speed. It is measured in the SI unit of frequency hertz (Hz).

<span class="mw-page-title-main">AMD K6-III</span> Microprocessor series by AMD

The K6-III was an x86 microprocessor line manufactured by AMD that launched on February 22, 1999. The launch consisted of both 400 and 450 MHz models and was based on the preceding K6-2 architecture. Its improved 256 KB on-chip L2 cache gave it significant improvements in system performance over its predecessor the K6-2. The K6-III was the last processor officially released for desktop Socket 7 systems, however later mobile K6-III+ and K6-2+ processors could be run unofficially in certain socket 7 motherboards if an updated BIOS was made available for a given board. The Pentium III processor from Intel launched 6 days later.

<span class="mw-page-title-main">Load testing</span> Process of putting demand on a system and measuring its response

Load testing is the process of putting demand on a structure or system and measuring its response.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

In the fields of digital electronics and computer hardware, multi-channel memory architecture is a technology that increases the data transfer rate between the DRAM memory and the memory controller by adding more channels of communication between them. Theoretically, this multiplies the data rate by exactly the number of channels present. Dual-channel memory employs two channels. The technique goes back as far as the 1960s having been used in IBM System/360 Model 91 and in CDC 6600.

<span class="mw-page-title-main">Super PI</span>

Super PI is a computer program that calculates pi to a specified number of digits after the decimal point—up to a maximum of 32 million. It uses Gauss–Legendre algorithm and is a Windows port of the program used by Yasumasa Kanada in 1995 to compute pi to 232 digits.

<span class="mw-page-title-main">Sandy Bridge</span> Intel processor microarchitecture

Sandy Bridge is the codename for Intel's 32 nm microarchitecture used in the second generation of the Intel Core processors. The Sandy Bridge microarchitecture is the successor to Nehalem and Westmere microarchitecture. Intel demonstrated a Sandy Bridge processor in 2009, and released first products based on the architecture in January 2011 under the Core brand.

Stress testing is a software testing activity that determines the robustness of software by testing beyond the limits of normal operation. Stress testing is particularly important for "mission critical" software, but is used for all types of software. Stress tests commonly put a greater emphasis on robustness, availability, and error handling under a heavy load, than on what would be considered correct behavior under normal circumstances.

In computing, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructions. When it comes to high computer performance, one or more of the following factors might be involved:

<span class="mw-page-title-main">Haswell (microarchitecture)</span> Intel processor microarchitecture

Haswell is the codename for a processor microarchitecture developed by Intel as the "fourth-generation core" successor to the Ivy Bridge. Intel officially announced CPUs based on this microarchitecture on June 4, 2013, at Computex Taipei 2013, while a working Haswell chip was demonstrated at the 2011 Intel Developer Forum. With Haswell, which uses a 22 nm process, Intel also introduced low-power processors designed for convertible or "hybrid" ultrabooks, designated by the "U" suffix.

<span class="mw-page-title-main">LGA 2011</span> CPU socket created by Intel

LGA 2011, also called Socket R, is a CPU socket by Intel released on November 14, 2011. It launched along with LGA 1356 to replace its predecessor, LGA 1366 and LGA 1567. While LGA 1356 was designed for dual-processor or low-end servers, LGA 2011 was designed for high-end desktops and high-performance servers. The socket has 2011 protruding pins that touch contact points on the underside of the processor.

The Intel X79 is a Platform Controller Hub (PCH) designed and manufactured by Intel for their LGA 2011 and LGA 2011-1.

<span class="mw-page-title-main">Skylake (microarchitecture)</span> CPU microarchitecture by Intel

Skylake is Intel's codename for its sixth generation Core microprocessor family that was launched on August 5, 2015, succeeding the Broadwell microarchitecture. Skylake is a microarchitecture redesign using the same 14 nm manufacturing process technology as its predecessor, serving as a tock in Intel's tick–tock manufacturing and design model. According to Intel, the redesign brings greater CPU and GPU performance and reduced power consumption. Skylake CPUs share their microarchitecture with Kaby Lake, Coffee Lake, Cannon Lake, Whiskey Lake, and Comet Lake CPUs.

<span class="mw-page-title-main">Ivy Bridge (microarchitecture)</span> CPU microarchitecture by Intel

Ivy Bridge is the codename for Intel's 22 nm microarchitecture used in the third generation of the Intel Core processors. Ivy Bridge is a die shrink to 22 nm process based on FinFET ("3D") Tri-Gate transistors, from the former generation's 32 nm Sandy Bridge microarchitecture—also known as tick–tock model. The name is also applied more broadly to the Xeon and Core i7 Ivy Bridge-E series of processors released in 2013.

Reliability verification or reliability testing is a method to evaluate the reliability of the product in all environments such as expected use, transportation, or storage during the specified lifespan. It is to expose the product to natural or artificial environmental conditions to undergo its action to evaluate the performance of the product under the environmental conditions of actual use, transportation, and storage, and to analyze and study the degree of influence of environmental factors and their mechanism of action. Through the use of various environmental test equipment to simulate the high temperature, low temperature, and high humidity, and temperature changes in the climate environment, to accelerate the reaction of the product in the use environment, to verify whether it reaches the expected quality in R&D, design, and manufacturing.

References

  1. Nelson, Wayne B., (2004), Accelerated Testing - Statistical Models, Test Plans, and Data Analysis, John Wiley & Sons, New York, ISBN   0-471-69736-2
  2. Sin0822 (2011-12-24). "Sandy Bridge E Overclocking Guide: Walk through, Explanations, and Support for all X79". overclock.net. Retrieved 2 February 2013.(some text condensed)
  3. Juan Jose Guerrero III - ASUS (2012-03-29). "Intel X79 Motherboard Overclocking Guide". benchmarkreviews.com. Retrieved 2 February 2013.
  4. Weber, Wolfgang; Tondok, Heidemarie; Bachmayer, Michael (2003). Anderson, Stuart; Felici, Massimo; Littlewood, Bev (eds.). "Enhancing Software Safety by Fault Trees: Experiences from an Application to Flight Critical SW". Computer Safety, Reliability, and Security. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. 2788: 289–302. doi:10.1007/978-3-540-39878-3_23. ISBN   978-3-540-39878-3.
  5. Jung, Byung C.; Shin, Yun-Ho; Lee, Sang Hyuk; Huh, Young Cheol; Oh, Hyunseok (January 2020). "A Response-Adaptive Method for Design of Validation Experiments in Computational Mechanics". Applied Sciences. 10 (2): 647. doi: 10.3390/app10020647 .
  6. Fan, A.; Wang, J.; Aptekar, V. (March 2019). "Advanced Circuit Reliability Verification for Robust Design". 2019 IEEE International Reliability Physics Symposium (IRPS). pp. 1–8. doi:10.1109/IRPS.2019.8720531. ISBN   978-1-5386-9504-3. S2CID   169037244.
  7. Halamay, D. A.; Starrett, M.; Brekken, T. K. A. (2019). "Hardware Testing of Electric Hot Water Heaters Providing Energy Storage and Demand Response Through Model Predictive Control". IEEE Access. 7: 139047–139057. doi: 10.1109/ACCESS.2019.2932978 . ISSN   2169-3536.