Testing high-performance computing applications

Last updated December 08, 2024

High performance computing applications run on massively parallel supercomputers consist of concurrent programs designed using multi-threaded, multi-process models. The applications may consist of various constructs (threads, local processes, distributed processes, etc.) with varying degree of parallelism. Although high performance concurrent programs use similar design patterns, models and principles as that of sequential programs, unlike sequential programs, they typically demonstrate non-deterministic behavior. The probability of bugs increases with the number of interactions between the various parallel constructs. Race conditions, data races, deadlocks, missed signals and live lock are common error types.

Challenges

Parallel programs can be divided into two general categories: explicitly and implicitly parallel. Using parallel language constructs defined for process creation, communication and synchronization make an application explicitly parallel. Using a tool or parallelizing compiler to convert a serial program into a parallel one, makes it implicitly parallel. Both categories are equally bug-prone.

Heisenbugs

Concurrent applications should execute correctly on every possible thread schedule in the underlying operating system. However, traditional testing methods detect few bugs, chiefly because of the Heisenbugs ^[1] problem. A Heisenbug is an error that changes or disappears when an attempt is made to isolate and probe them via debugger, by adding some constructs such as synchronization requests or delay statements.

Non-repeatability

Another issue is caused due to the unpredictable behavior of the scheduler. Differences in system load influence scheduler behavior. This behavior cannot be changed manually. To counter this indeterminism, the program must be executed many times under various execution environments. Still, it is not guaranteed that a bug can be reproduced. Most of the time, the program runs correctly, and the bug is visible only when specific conditions are matched. As a result, non-repeatability of the concurrent programs is a major source of roadblock for detecting error. As an example, consider the following.

voidthread1(void*t){mutex_lock(a);mutex_lock(b);// do some work...mutex_unlock(b);mutex_unlock(a);}

voidthread2(void*t){mutex_lock(b);mutex_lock(a);// do some work...mutex_unlock(a);mutex_unlock(b);}

Clearly, this has a problem of causing deadlocks. Yet, it may cause deadlock in some runs of the program while in others, it may run successfully.

Probe effect

Probe effect is seen in parallel programs when delay-statements are inserted in parallel programs facing synchronization problems. This effect, like Heisenbugs, alters behavior changes that may obscure problems. Detecting the source of a probe effect is a great challenge in testing parallel applications.
The main difference between Probe effect and Heisenbugs is that Heisenbugs are seen when additional delay statements and/or synchronization requests are added to the concurrent application during testing, while probe effect is seen when the developer adds delay statements to concurrent applications with poor synchronization.

Testing strategies

The differences between sequential and concurrent programs lead to the differences in their testing strategies. Strategies for sequential programs can be modified to make them suitable for concurrent applications. Specialized strategies have also been developed. Conventionally, testing includes designing test cases and checking that the program produces the expected results. Thus, errors in specification, functionality, etc. are detected by running the application and subjecting it to testing methods such as Functional Testing, White Box, Black Box and Grey Box Testing.^[2] Static analysis is also used for detecting errors in high performance software using methods such as Data Flow Analysis, Control Flow Analysis, Cyclomatic Complexities, Thread Escape Analysis and Static Slicing Analysis to find problems. Using static analysis before functionality testing can save time. It can detect ‘what the error is’ find the error source. Static analysis techniques can detect problems like lack of synchronization, improper synchronizations, predict occurrence of deadlocks and post-wait errors in rendezvous requests.

Details:

Deterministic scheduling / reproducible testing

The indeterminacy of scheduling has two sources.^[1]

1. Time slicing

Scheduler switches contexts at equal intervals of time. This interval depends on the speed of individual processors, memory-cache hierarchy state and system load. Even on the same processor, under the same load, the interval varies slightly due to minor variations in frequency of the system clock.

2. Page Faults

Scheduler suspends a program if a page fault occurs, letting other threads proceed while the system fetches the page. As the occurrence of page faults depends upon what other processes are running, the timing of a context switch becomes indeterminate.

To make concurrent programs repeatable, an external scheduler is used. The program under test is instrumented to add calls to this scheduler. Such calls are made at the beginning and end of each thread as well as before every synchronization request. This scheduler selectively blocks threads of execution by maintaining a semaphore associated with each thread, such that only one thread is ready for execution at any given time. Thus, it converts parallel non-deterministic application into a serial execution sequence in order to achieve repeatability. The number of scheduling decisions made by the serializing scheduler is given by –

(N * K / P)*{(N + P)!}
 Where 
 N = number of threads
 K = potential context switch points
 P = budget of pre-emptive context switches

Feedback-directed testing

To obtain more accurate results using deterministic scheduling, an alternate approach can be chosen. A few properly-placed pre-emptions in the concurrent program can detect bugs related to data-races.^[1] Bugs are found in clusters. The existence of one bug establishes a high probability of more bugs in the same region of code. Thus each pass of the testing process identifies sections of code with bugs. The next pass more thoroughly scrutinizes those sections by adding scheduler calls around them. Allowing the problematic locations to execute in a different order can reveal unexpected behavior.

Timing-related testing

This strategy ensures that the application is not prone to the Probe Effect. Sources of errors that cause the Probe Effect can range from task creation issues to synchronization and communication problems. Requirements of timing related tests:^[3]

Delay duration must vary
Delay points must cover appropriate program locations
Delay statements must be inserted, removed or relocated to induce Probe Effect

The number of test cases per input data set is:

ⁿC₁ + ⁿC₁ + … + ⁿC₁ = 2ⁿ -1
 Where n = total number of synchronization, process creation and communication calls.

This equation has exponential order. In order to reduce the number of test cases, either Deterministic Execution Method (DET) or Multiple Execution Technique (MET) is used. Various issues must be handled:

Delayed execution

Addition of delays is a straightforward task. A typical sleep() statement can be used to insert delays.

Deciding where to insert delays

Insertion locations are known as delay-points. As the objective of timing related test cases is to detect synchronization, communication and thread creation errors, delay statements are added just in front of these statements.

Advantages

Easy to implement on multiple processors without any need of ordering the synchronization requests.
No need to generate concurrency graph
More effective for fault detection
Total number of test cases are less, yet code coverage is more, due to static analysis

All Du-Path testing

This method applies the concept of define-use pair, in order to determine the paths to be tested.

Verification strategies

Software verification is a process that proves that software is working correctly and is performing the intended task as designed.

Test calculations

Input is given to the system to generate a known result. This input-result pair can be obtained from previous empirical results and/or manual calculations.^[4] This is a system-level test that can be performed only when all relevant modules are integrated. Moreover, it only shows that bugs exist. It offers no detailed information regarding the number of bugs, their location or nature.

Symmetry tests

These tests are primarily used for scientific simulations. The output of the simulation often cannot be predicted. Since these simulations attempt to describe scientific laws, any symmetries in the theory must be honored by the simulation. Thus, by varying input conditions along the lines of symmetry and then comparing the obtained results with externally derived results, the existence of bugs can be detected.^[4]

In scientific computing most data lies in the central region of the simulation conditions. As a result, it is difficult to perform Boundary-value testing ^[2] with real time experimental data. Thus, center of simulation (for example, for data value of 10 in Figure 1) is shifted to one of the boundaries so as to test the boundary condition effectively.

Parallel implementation tests

Parallel implementation tests are usually used for applications that use distributed memory programming models such as message passing. These tests are often applied to programs that use regular grids of processors.^[4]^{[ clarification needed ]}

Global summation

Many parallel databases use distributed parallel processing to execute the queries. While executing an aggregate function such as sum, the following strategy is used:^[5]

Compute a partial sum locally and concurrently at each processor using the data present in the associated disk partition with it.
Add these local subtotals to get the final result.

The final result may contain some rounding error as each processor independently rounds-off local results. One test is to ensure that such errors do not occur. This requires showing that the aggregate sum is decomposition-independent. An alternate summation scheme is to send all of the individual values to one processor for summation. This result can be compared with the distributed result to ensure consistency.

Tools

Microsoft CHESS

This tool eliminates non-determinacy using deterministic scheduling. It tracks schedule paths executed previously and guarantees that each time a new schedule path is executed.^[1]^{[ clarification needed ]}

Related Research Articles

Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response. Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines".

In computing, a process is the instance of a computer program that is being executed by one or many threads. There are many different process models, some of which are light weight, but almost all processes are rooted in an operating system (OS) process which comprises the program code, assigned system resources, physical and logical access permissions, and data structures to initiate, control and coordinate execution activity. Depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer science, a lock or mutex is a synchronization primitive that prevents state from being modified or accessed by multiple threads of execution at once. Locks enforce mutual exclusion concurrency control policies, and with a variety of possible methods there exist multiple unique implementations for different applications.

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events, leading to unexpected or inconsistent results. It becomes a bug when one or more of the possible behaviors is undesirable.

In concurrent programming, concurrent accesses to shared resources can lead to unexpected or erroneous behavior. Thus, the parts of the program where the shared resource is accessed need to be protected in ways that avoid the concurrent access. One way to do so is known as a critical section or critical region. This protected section cannot be entered by more than one process or thread at a time; others are suspended until the first leaves the critical section. Typically, the critical section accesses a shared resource, such as a data structure, peripheral device, or network connection, that would not operate correctly in the context of multiple concurrent accesses.

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This typically means that operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier.

Concurrency refers to the ability of a system to execute multiple tasks through simultaneous execution or time-sharing, sharing resources and managing interactions. Concurrency improves responsiveness, throughput, and scalability in modern computing, including:

Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.

Java Pathfinder (JPF) is a system to verify executable Java bytecode programs. JPF was developed at the NASA Ames Research Center and open sourced in 2005. The acronym JPF is not to be confused with the unrelated Java Plugin Framework project.

In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

In computer programming jargon, a heisenbug is a software bug that seems to disappear or alter its behavior when one attempts to study it. The term is a pun on the name of Werner Heisenberg, the physicist who first asserted the observer effect of quantum mechanics, which states that the act of observing a system inevitably alters its state. In electronics, the traditional term is probe effect, where attaching a test probe to a device changes its behavior.

In computer science, synchronization is the task of coordinating multiple processes to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action.

In engineering, debugging is the process of finding the root cause, workarounds and possible fixes for bugs.

Jinx was a concurrency debugger that deterministically controlled the interleaving of workloads across processor cores, focusing on shared memory interactions. Using this deterministic approach, Jinx aimed to increase the frequency of occurrence of elusive shared memory bugs, sometimes called Heisenbugs. Jinx is no longer available. Corensic, the company that was developing Jinx, was bought by F5 Networks and the Jinx project was cancelled.

Research and literature on concurrency testing and concurrent testing typically focuses on testing software and systems that use concurrent computing. The purpose is, as with most software testing, to understand the behaviour and performance of a software system that uses concurrent computing, particularly assessing the stability of a system or application during normal activity.

References

1 2 3 4 Thomas Ball; Sebastian Burckhardt; Peli de Halleux; Madanlal Musuvathi; Shaz Qadeer (May–June 2011). "Predictable and Progressive Testing of Multithreaded Code". IEEE Software. 28 (3): 75–83. doi:10.1109/MS.2010.64.
1 2 The Art of Software Testing, Second Edition. John Wiley and Sons. 2004. p. 256. ISBN 978-0-471-46912-4.
↑ Cheer-Sun Yang; Lori L. Pollock (1997). "The Challenges in Automated Testing of Multithreaded Programs". Proceedings of the 14th International Conference on Testing Computer Software: 157–166. CiteSeerX 10.1.1.52.8791 .
1 2 3 Stephen Booth; David Henty (2004). "Verification strategies for high performance computing software". "Software Engineering for High Performance Computing System (HPCS) Applications" W3S Workshop - 26th International Conference on Software Engineering. Vol. 2004. pp. 24–26. doi:10.1049/ic:20040413 (inactive 7 December 2024). ISBN 0-86341-418-4.{{cite book}}: CS1 maint: DOI inactive as of December 2024 (link)
↑ Korth, Henry. Database System Concepts. McGraw-Hill.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[predictable-1] 1 2 3 4 Thomas Ball; Sebastian Burckhardt; Peli de Halleux; Madanlal Musuvathi; Shaz Qadeer (May–June 2011). "Predictable and Progressive Testing of Multithreaded Code". IEEE Software. 28 (3): 75–83. doi:10.1109/MS.2010.64.

[artofswtesting-2] 1 2 The Art of Software Testing, Second Edition. John Wiley and Sons. 2004. p. 256. ISBN 978-0-471-46912-4.

[automated-3] Cheer-Sun Yang; Lori L. Pollock (1997). "The Challenges in Automated Testing of Multithreaded Programs". Proceedings of the 14th International Conference on Testing Computer Software: 157–166. CiteSeerX 10.1.1.52.8791 .

[verification-4] 1 2 3 Stephen Booth; David Henty (2004). "Verification strategies for high performance computing software". "Software Engineering for High Performance Computing System (HPCS) Applications" W3S Workshop - 26th International Conference on Software Engineering. Vol. 2004. pp. 24–26. doi:10.1049/ic:20040413 (inactive 7 December 2024). ISBN 0-86341-418-4.{{cite book}}: CS1 maint: DOI inactive as of December 2024 (link)

[5] Korth, Henry. Database System Concepts. McGraw-Hill.

[1]

[2]

[3]

[4]

[5]