Serviceability (computer)

Last updated January 16, 2023

In software engineering and hardware engineering, serviceability (also known as supportability) is one of the -ilities or aspects (from IBM's RAS(U) (Reliability, Availability, Serviceability, and Usability)). It refers to the ability of technical support personnel to install, configure, and monitor computer products, identify exceptions or faults, debug or isolate faults to root cause analysis, and provide hardware or software maintenance in pursuit of solving a problem and restoring the product into service. Incorporating serviceability facilitating features typically results in more efficient product maintenance and reduces operational costs and maintains business continuity.

Examples of features that facilitate serviceability include:

Help desk notification of exceptional events (e.g., by electronic mail or by sending text to a pager)
Network monitoring
Documentation
Event logging / Tracing (software)
Logging of program state, such as
- Execution path and/or local and global variables
- Procedure entry and exit, optionally with incoming and return variable values (see: subroutine)
- Exception block entry, optionally with local state (see: exception handling)
Software upgrade
Graceful degradation, where the product is designed to allow recovery from exceptional events without intervention by technical support staff
Hardware replacement or upgrade planning, where the product is designed to allow efficient hardware upgrades with minimal computer system downtime (e.g., hotswap components.)

Serviceability engineering may also incorporate some routine system maintenance related features (see: Operations, Administration and Maintenance (OA&M.))

A service tool is defined as a facility or feature, closely tied to a product, that provides capabilities and data so as to service (analyze, monitor, debug, repair, etc.) that product. Service tools can provide broad ranges of capabilities. Regarding diagnosis, a proposed taxonomy of service tools is as follows:

Level 1: Service tool that indicates if a product is functional or not functional. Describing computer servers, the states are often referred to as ‘up’ or ‘down’. This is a binary value.
Level 2: Service tool that provides some detailed diagnostic data. Often the diagnostic data is referred to as a problem ‘signature’, a representation of key values such as system environment, running program name, etc. This level of data is used to compare one problem’s signature to another problem’s signature: the ability to match the new problem to an old one allows one to use the solution already created for the prior problem. The ability to screen problems is valuable when a problem does match a pre-existing problem, but it is not sufficient to debug a new problem.
Level 3: Provides detailed diagnostic data sufficient to debug a new and unique problem.

As a rough rule of thumb for these taxonomies, there are multiple ‘orders of magnitude’ of diagnostic data in level 1 vs. level 2 vs. level 3 service tools.

Additional characteristics and capabilities that have been observed in service tools:

Time of data collection: some tools can collect data immediately, as soon as problem occurs, others are delayed in collecting data.
Pre-analyzed, or not-yet-analyzed data: some tools collect ‘external’ data, while others collect ‘internal’ data. This is seen when comparing system messages (natural language-like statements in the user’s native language) vs. ‘binary’ storage dumps.
Partial or full set of system state data: some tools collect a complete system state vs. a partial system state (user or partial ‘binary’ storage dump vs. complete system dump).
Raw or analyzed data: some tools display raw data, while others analyze it (examples storage dump formatters that format data, vs. ‘intelligent’ data formatters (“ANALYZE” is a common verb) that combine product knowledge with analysis of state variables to indicate the ‘meaning’ of the data.
Programmable tools vs. ‘fixed function’ tools. Some tools can be altered to get varying amounts of data, at varying times. Some tools have only a fixed function.
Automatic or manual? Some tools are built into a product, to automatically collect data when a fault or failure occurs. Other tools have to be specifically invoked to start the data collection process.
Repair or non-repair? Some tools collect data as a fore-runner to an automatic repair process (self-healing/fault tolerant). These tools have the challenge of quickly obtaining unaltered data before the desired repair process starts.

External links

Excellent example of Serviceability Feature Requirements:

Sun Gathering Debug Data (Sun GDD). This is a set of tools developed by the Sun's support guys aimed to provide the right approach to problem resolution by leveraging proactive actions and best practices to gather the debug data needed for further analysis.
"Carrier Grade Linux Serviceability Requirements Definition Version 4," Copyright (c) 2005-2007 by Open Source Development Labs, Inc. Beaverton, OR 97005 USA

Related Research Articles

Multiple Virtual Storage, more commonly called MVS, was the most commonly used operating system on the System/370 and System/390 IBM mainframe computers. IBM developed MVS, along with OS/VS1 and SVS, as a successor to OS/360. It is unrelated to IBM's other mainframe operating system lines, e.g., VSE, VM, TPF.

An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is embedded as part of a complete device often including electrical or electronic hardware and mechanical parts. Because an embedded system typically controls physical operations of the machine that it is embedded within, it often has real-time computing constraints. Embedded systems control many devices in common use today. In 2009, it was estimated that ninety-eight percent of all microprocessors manufactured were used in embedded systems.

In computing, a core dump, memory dump, crash dump, storage dump, system dump, or ABEND dump consists of the recorded state of the working memory of a computer program at a specific time, generally when the program has crashed or otherwise terminated abnormally. In practice, other key pieces of program state are usually dumped at the same time, including the processor registers, which may include the program counter and stack pointer, memory management information, and other processor and operating system flags and information. A snapshot dump is a memory dump requested by the computer operator or by the running program, after which the program is able to continue. Core dumps are often used to assist in diagnosing and debugging errors in computer programs.

<span class="mw-page-title-main">Debugger</span> Computer program used to test and debug other programs

A debugger or debugging tool is a computer program used to test and debug other programs. The main use of a debugger is to run the target program under controlled conditions that permit the programmer to track its execution and monitor changes in computer resources that may indicate malfunctioning code. Typical debugging facilities include the ability to run or halt the target program at specific points, display the contents of memory, CPU registers or storage devices, and modify memory or register contents in order to enter selected test data that might be a cause of faulty program execution.

JTAG is an industry standard for verifying designs and testing printed circuit boards after manufacture.

Technical support is a call centre type customer service provided by companies to advise and assist registered users with issues concerning their technical products. Traditionally done on the phone, technical support can now be conducted online or through chat. At present, most large and mid-size companies have outsourced their tech support operations. Many companies provide discussion boards for users of their products to interact; such forums allow companies to reduce their support costs without losing the benefit of customer feedback.

In software development, a breakpoint is an intentional stopping or pausing place in a program, put in place for debugging purposes. It is also sometimes simply referred to as a pause.

Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to function without failure. Reliability describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time.

Design for testing or design for testability (DFT) consists of IC design techniques that add testability features to a hardware product design. The added features make it easier to develop and apply manufacturing tests to the designed hardware. The purpose of manufacturing tests is to validate that the product hardware contains no manufacturing defects that could adversely affect the product's correct functioning.

<span class="mw-page-title-main">Bus analyzer</span>

A bus analyzer is a type of a protocol analysis tool, used for capturing and analyzing communication data across a specific interface bus, usually embedded in a hardware system. The bus analyzer functionality helps design, test and validation engineers to check, test, debug and validate their designs throughout the design cycles of a hardware-based product. It also helps in later phases of a product life cycle, in examining communication interoperability between systems and between components, and clarifying hardware support concerns.

<span class="mw-page-title-main">Hardware emulation</span> Emulating hardware devices in IC design

In integrated circuit design, hardware emulation is the process of imitating the behavior of one or more pieces of hardware with another piece of hardware, typically a special purpose emulation system. The emulation model is usually based on a hardware description language source code, which is compiled into the format used by emulation system. The goal is normally debugging and functional verification of the system being designed. Often an emulator is fast enough to be plugged into a working target system in place of a yet-to-be-built chip, so the whole system can be debugged with live data. This is a specific case of in-circuit emulation.

Windows Error Reporting (WER) is a crash reporting technology introduced by Microsoft with Windows XP and included in later Windows versions and Windows Mobile 5.0 and 6.0. Not to be confused with the Dr. Watson debugging tool which left the memory dump on the user's local machine, Windows Error Reporting collects and offers to send post-error debug information using the Internet to Microsoft when an application crashes or stops responding on a user's desktop. No data is sent without the user's consent. When a crash dump reaches the Microsoft server, it is analyzed, and information about a solution is sent back to the user if available. Solutions are served using Windows Error Reporting Responses. Windows Error Reporting runs as a Windows service. Kinshuman is the original architect of WER. WER was also included in the ACM hall of fame for its impact on the computing industry.

In computer science, fault injection is a testing technique for understanding how computing systems behave when stressed in unusual ways. This can be achieved using physical- or software-based means, or using a hybrid approach. Widely studied physical fault injections include the application of high voltages, extreme temperatures and electromagnetic pulses on electronic components, such as computer memory and central processing units. By exposing components to conditions beyond their intended operating limits, computing systems can be coerced into mis-executing instructions and corrupting critical data.

In the context of computer programming, instrumentation refers to the measure of a product's performance, in order to diagnose errors and to write trace information. Instrumentation can be of two types: source instrumentation and binary instrumentation.

Eclipse OpenJ9 is a high performance, scalable, Java virtual machine (JVM) implementation that is fully compliant with the Java Virtual Machine Specification.

ELinOS is a commercial development environment for embedded Linux. It consists of a Linux distribution for the target embedded system and development tools for a development host computer. The development host computer usually is a standard desktop computer running Linux or Windows. The Linux system and the application software for the target device are both created on the development host.

In computer programming and software development, debugging is the process of finding and resolving bugs within computer programs, software, or systems.

Process Control Daemon (PCD) is an open source, light-weight system level process manager/controller for Embedded Linux based projects.

An intelligent maintenance system (IMS) is a system that utilizes collected data from machinery in order to predict and prevent potential failures in them. The occurrence of failures in machinery can be costly and even catastrophic. In order to avoid failures, there needs to be a system which analyzes the behavior of the machine and provides alarms and instructions for preventive maintenance. Analyzing the behavior of the machines has become possible by means of advanced sensors, data collection systems, data storage/transfer capabilities and data analysis tools. These are the same set of tools developed for prognostics. The aggregation of data collection, storage, transformation, analysis and decision making for smart maintenance is called an intelligent maintenance system (IMS).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

Serviceability (computer)

See also

External links

Related Research Articles