Data validation

Last updated

In computer science, data validation is the process of ensuring data has undergone data cleansing to confirm they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic of the computer and its application.

Contents

This is distinct from formal verification, which attempts to prove or disprove the correctness of algorithms for implementing a specification or property.

Overview

Data validation is intended to provide certain well-defined guarantees for fitness and consistency of data in an application or automated system. Data validation rules can be defined and designed using various methodologies, and be deployed in various contexts. [1] Their implementation can use declarative data integrity rules, or procedure-based business rules. [2]

The guarantees of data validation do not necessarily include accuracy, and it is possible for data entry errors such as misspellings to be accepted as valid. Other clerical and/or computer controls may be applied to reduce inaccuracy within a system.

Different kinds

In evaluating the basics of data validation, generalizations can be made regarding the different kinds of validation according to their scope, complexity, and purpose.

For example:

Data-type check

Data type validation is customarily carried out on one or more simple data fields.

The simplest kind of data type validation verifies that the individual characters provided through user input are consistent with the expected characters of one or more known primitive data types as defined in a programming language or data storage and retrieval mechanism.

For example, an integer field may require input to use only characters 0 through 9.

Simple range and constraint check

Simple range and constraint validation may examine input for consistency with a minimum/maximum range, or consistency with a test for evaluating a sequence of characters, such as one or more tests against regular expressions. For example, a counter value may be required to be a non-negative integer, and a password may be required to meet a minimum length and contain characters from multiple categories.

Code and cross-reference check

Code and cross-reference validation includes operations to verify that data is consistent with one or more possibly-external rules, requirements, or collections relevant to a particular organization, context or set of underlying assumptions. These additional validity constraints may involve cross-referencing supplied data with a known look-up table or directory information service such as LDAP.

For example, a user-provided country code might be required to identify a current geopolitical region.

Structured check

Structured validation allows for the combination of other kinds of validation, along with more complex processing. Such complex processing may include the testing of conditional constraints for an entire complex data object or set of process operations within a system.

Consistency check

Consistency validation ensures that data is logical. For example, the delivery date of an order can be prohibited from preceding its shipment date.

Example

Multiple kinds of data validation are relevant to 10-digit pre-2007 ISBNs (the 2005 edition of ISO 2108 required ISBNs to have 13 digits from 2007 onwards [3] ).

Validation types

Allowed character checks
Checks to ascertain that only expected characters are present in a field. For example a numeric field may only allow the digits 0–9, the decimal point and perhaps a minus sign or commas. A text field such as a personal name might disallow characters used for markup. An e-mail address might require at least one @ sign and various other structural details. Regular expressions can be effective ways to implement such checks.
Batch totals
Checks for missing records. Numerical fields may be added together for all records in a batch. The batch total is entered and the computer checks that the total is correct, e.g., add the 'Total Cost' field of a number of transactions together.
Cardinality check
Checks that record has a valid number of related records. For example, if a contact record is classified as "customer" then it must have at least one associated order (cardinality > 0). This type of rule can be complicated by additional conditions. For example, if a contact record in a payroll database is classified as "former employee" then it must not have any associated salary payments after the separation date (cardinality = 0).
Check digits
Used for numerical data. To support error detection, an extra digit is added to a number which is calculated from the other digits.
Consistency checks
Checks fields to ensure data in these fields correspond, e.g., if expiration date is in the past then status is not "active".
Cross-system consistency checks
Compares data in different systems to ensure it is consistent. Systems may represent the same data differently, in which case comparison requires transformation (e.g., one system may store customer name in a single Name field as 'Doe, John Q', while another uses First_Name 'John' and Last_Name 'Doe' and Middle_Name 'Quality').
Data type checks
Checks input conformance with typed data. For example, an input box accepting numeric data may reject the letter 'O'.
File existence check
Checks that a file with a specified name exists. This check is essential for programs that use file handling.
Format check
Checks that the data is in a specified format (template), e.g., dates have to be in the format YYYY-MM-DD. Regular expressions may be used for this kind of validation.
Presence check
Checks that data is present, e.g., customers may be required to have an email address.
Range check
Checks that the data is within a specified range of values, e.g., a probability must be between 0 and 1.
Referential integrity
Values in two relational database tables can be linked through foreign key and primary key. If values in the foreign key field are not constrained by internal mechanisms, then they should be validated to ensure that the referencing table always refers to a row in the referenced table.
Spelling and grammar check
Looks for spelling and grammatical errors.
Uniqueness check
Checks that each value is unique. This can be applied to several fields (i.e. Address, First Name, Last Name).
Table look up check
A table look up check compares data to a collection of allowed values.

Post-validation actions

Enforcement Action
Enforcement action typically rejects the data entry request and requires the input actor to make a change that brings the data into compliance. This is most suitable for interactive use, where a real person is sitting on the computer and making entry. It also works well for batch upload, where a file input may be rejected and a set of messages sent back to the input source for why the data is rejected.
Another form of enforcement action involves automatically changing the data and saving a conformant version instead of the original version. This is most suitable for cosmetic change. For example, converting an [all-caps] entry to a [Pascal case] entry does not need user input. An inappropriate use of automatic enforcement would be in situations where the enforcement leads to loss of business information. For example, saving a truncated comment if the length is longer than expected. This is not typically a good thing since it may result in loss of significant data.
Advisory Action
Advisory actions typically allow data to be entered unchanged but sends a message to the source actor indicating those validation issues that were encountered. This is most suitable for non-interactive system, for systems where the change is not business critical, for cleansing steps of existing data and for verification steps of an entry process.
Verification Action
Verification actions are special cases of advisory actions. In this case, the source actor is asked to verify that this data is what they would really want to enter, in the light of a suggestion to the contrary. Here, the check step suggests an alternative (e.g., a check of a mailing address returns a different way of formatting that address or suggests a different address altogether). You would want in this case, to give the user the option of accepting the recommendation or keeping their version. This is not a strict validation process, by design and is useful for capturing addresses to a new location or to a location that is not yet supported by the validation databases.
Log of validation
Even in cases where data validation did not find any issues, providing a log of validations that were conducted and their results is important. This is helpful to identify any missing data validation checks in light of data issues and in improving the validation.

Validation and security

Failures or omissions in data validation can lead to data corruption or a security vulnerability. [4] Data validation checks that data are fit for purpose, [5] valid, sensible, reasonable and secure before they are processed.

See also

Related Research Articles

<span class="mw-page-title-main">International Bank Account Number</span> Alphanumeric code that uniquely identifies a bank account in any participating country

The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors. An IBAN uniquely identifies the account of a customer at a financial institution. It was originally adopted by the European Committee for Banking Standards (ECBS) and since 1997 as the international standard ISO 13616 under the International Organization for Standardization (ISO). The current version is ISO 13616:2020, which indicates the Society for Worldwide Interbank Financial Telecommunication (SWIFT) as the formal registrar. Initially developed to facilitate payments within the European Union, it has been implemented by most European countries and numerous countries in other parts of the world, mainly in the Middle East and the Caribbean. As of May 2020, 77 countries were using the IBAN numbering system.

Software testing is the act of examining the artifacts and the behavior of the software under test by validation and verification. Software testing can also provide an objective, independent view of the software to allow the business to appreciate and understand the risks of software implementation. Test techniques include, but are not necessarily limited to:

Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context – even under the same general umbrella of computing. It is at times used as a proxy term for data quality, while data validation is a prerequisite for data integrity. Data integrity is the opposite of data corruption. The overall intent of any data integrity technique is the same: ensure data is recorded exactly as intended. Moreover, upon later retrieval, ensure the data is the same as when it was originally recorded. In short, data integrity aims to prevent unintentional changes to information. Data integrity is not to be confused with data security, the discipline of protecting data from unauthorized parties.

In computer science, ACID is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type to every "term". Usually the terms are various constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term. For variables, the type system determines the allowed values of that term. Type systems formalize and enforce the otherwise implicit categories the programmer uses for algebraic data types, data structures, or other components.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reoccurring schedules either as single jobs or aggregated into a batch of jobs.

In computer science, a consistency model specifies a contract between the programmer and a system, wherein the system guarantees that if the programmer follows the rules for operations on memory, memory will be consistent and the results of reading, writing, or updating memory will be predictable. Consistency models are used in distributed systems like distributed shared memory systems or distributed data stores. Consistency is different from coherence, which occurs in systems that are cached or cache-less, and is consistency of data with respect to all processors. Coherence deals with maintaining a global order in which writes to a single location or single variable are seen by all processors. Consistency deals with the ordering of operations to multiple locations with respect to all processors.

An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to just the addr-spec in Section 3.4 of RFC 5322. The RFC defines address more broadly as either a mailbox or group. A mailbox value can be either a name-addr, which contains a display-name and addr-spec, or the more common addr-spec alone.

In product development and process optimization, a requirement is a singular documented physical or functional need that a particular design, product or process aims to satisfy. It is commonly used in a formal sense in engineering design, including for example in systems engineering, software engineering, or enterprise engineering. It is a broad concept that could speak to any necessary function, attribute, capability, characteristic, or quality of a system for it to have value and utility to a customer, organization, internal user, or other stakeholder. Requirements can come with different levels of specificity; for example, a requirement specification or requirement "spec" refers to an explicit, highly objective/clear requirement to be satisfied by a material, design, product, or service.

Modbus is a data communications protocol originally published by Modicon in 1979 for use with its programmable logic controllers (PLCs). Modbus has become a de facto standard communication protocol and is now a commonly available means of connecting industrial electronic devices.

In software project management, software testing, and software engineering, verification and validation (V&V) is the process of checking that a software system meets specifications and requirements so that it fulfills its intended purpose. It may also be referred to as software quality control. It is normally the responsibility of software testers as part of the software development lifecycle. In simple terms, software verification is: "Assuming we should build X, does our software achieve its goals without any bugs or gaps?" On the other hand, software validation is: "Was X what we should have built? Does X meet the high-level requirements?"

In electronic design automation, a design rule is a geometric constraint imposed on circuit board, semiconductor device, and integrated circuit (IC) designers to ensure their designs function properly, reliably, and can be produced with acceptable yield. Design rules for production are developed by process engineers based on the capability of their processes to realize design intent. Electronic design automation is used extensively to ensure that designers do not violate design rules; a process called design rule checking (DRC). DRC is a major step during physical verification signoff on the design, which also involves LVS checks, XOR checks, ERC, and antenna checks. The importance of design rules and DRC is greatest for ICs, which have micro- or nano-scale geometries; for advanced processes, some fabs also insist upon the use of more restricted rules to improve yield.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall.

Executable UML is both a software development method and a highly abstract software language. It was described for the first time in 2002 in the book "Executable UML: A Foundation for Model-Driven Architecture". The language "combines a subset of the UML graphical notation with executable semantics and timing rules." The Executable UML method is the successor to the Shlaer–Mellor method.

Entity–attribute–value model (EAV) is a data model to encode, in a space-efficient manner, entities where the number of attributes that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. Such entities correspond to the mathematical notion of a sparse matrix.

Real-time database has two meanings. The most common use of the term refers to a database system which uses streaming technologies to handle workloads whose state is constantly changing. This differs from traditional databases containing persistent data, mostly unaffected by time. When referring to streaming technologies, real-time processing means that a transaction is processed fast enough for the result to come back and be acted on right away. Such real-time databases are useful for assisting social media platforms in the removal of fake news, in-store surveillance cameras identifying potential shoplifters by their behavior/movements, etc.

In computer programming, a magic string is an input that a programmer believes will never come externally and which activates otherwise hidden functionality. A user of this program would likely provide input that gives an expected response in most situations. However, if the user does in fact innocently provide the pre-defined input, invoking the internal functionality, the program response is often quite unexpected to the user.

Forms processing is a process by which one can capture information entered into data fields and convert it into an electronic format. This can be done manually or automatically, but the general process is that hard copy data is filled out by humans and then "captured" from their respective fields and entered into a database or other electronic format.

In information technology a reasoning system is a software system that generates conclusions from available knowledge using logical techniques such as deduction and induction. Reasoning systems play an important role in the implementation of artificial intelligence and knowledge-based systems.

In business analysis, the Decision Model and Notation (DMN) is a standard published by the Object Management Group. It is a standard approach for describing and modeling repeatable decisions within organizations to ensure that decision models are interchangeable across organizations.

References