Data validation

Last updated January 15, 2025

In computing, data validation or input validation is the process of ensuring data has undergone data cleansing to confirm it has data quality, that is, that it is both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic of the computer and its application.

Overview

Data validation is intended to provide certain well-defined guarantees for fitness and consistency of data in an application or automated system. Data validation rules can be defined and designed using various methodologies, and be deployed in various contexts.^[1] Their implementation can use declarative data integrity rules, or procedure-based business rules.^[2]

The guarantees of data validation do not necessarily include accuracy, and it is possible for data entry errors such as misspellings to be accepted as valid. Other clerical and/or computer controls may be applied to reduce inaccuracy within a system.

Different kinds

In evaluating the basics of data validation, generalizations can be made regarding the different kinds of validation according to their scope, complexity, and purpose.

For example:

Data type validation;
Range and constraint validation;
Code and cross-reference validation;
Structured validation; and
Consistency validation

Data-type check

Data type validation is customarily carried out on one or more simple data fields.

The simplest kind of data type validation verifies that the individual characters provided through user input are consistent with the expected characters of one or more known primitive data types as defined in a programming language or data storage and retrieval mechanism.

For example, an integer field may require input to use only characters 0 through 9.

Simple range and constraint check

Simple range and constraint validation may examine input for consistency with a minimum/maximum range, or consistency with a test for evaluating a sequence of characters, such as one or more tests against regular expressions. For example, a counter value may be required to be a non-negative integer, and a password may be required to meet a minimum length and contain characters from multiple categories.

Code and cross-reference check

Code and cross-reference validation includes operations to verify that data is consistent with one or more possibly-external rules, requirements, or collections relevant to a particular organization, context or set of underlying assumptions. These additional validity constraints may involve cross-referencing supplied data with a known look-up table or directory information service such as LDAP.

For example, a user-provided country code might be required to identify a current geopolitical region.

Structured check

Structured validation allows for the combination of other kinds of validation, along with more complex processing. Such complex processing may include the testing of conditional constraints for an entire complex data object or set of process operations within a system.

Consistency check

Consistency validation ensures that data is logical. For example, the delivery date of an order can be prohibited from preceding its shipment date.

Example

Multiple kinds of data validation are relevant to 10-digit pre-2007 ISBNs (the 2005 edition of ISO 2108 required ISBNs to have 13 digits from 2007 onwards^[3]).

Size. A pre-2007 ISBN must consist of 10 digits, with optional hyphens or spaces separating its four parts.
Format checks. Each of the first 9 digits must be 0 through 9, and the 10th must be either 0 through 9 or an X.
Check digit. To detect transcription errors in which digits have been altered or transposed, the last digit of a pre-2007 ISBN must match the result of a mathematical formula incorporating the other 9 digits (ISBN-10 check digits).

Validation types

Allowed character checks: Checks to ascertain that only expected characters are present in a field. For example a numeric field may only allow the digits 0–9, the decimal point and perhaps a minus sign or commas. A text field such as a personal name might disallow characters used for markup. An e-mail address might require at least one @ sign and various other structural details. Regular expressions can be effective ways to implement such checks.

Batch totals: Checks for missing records. Numerical fields may be added together for all records in a batch. The batch total is entered and the computer checks that the total is correct, e.g., add the 'Total Cost' field of a number of transactions together.

Cardinality check: Checks that record has a valid number of related records. For example, if a contact record is classified as "customer" then it must have at least one associated order (cardinality > 0). This type of rule can be complicated by additional conditions. For example, if a contact record in a payroll database is classified as "former employee" then it must not have any associated salary payments after the separation date (cardinality = 0).

Check digits: Used for numerical data. To support error detection, an extra digit is added to a number which is calculated from the other digits.

Consistency checks: Checks fields to ensure data in these fields correspond, e.g., if expiration date is in the past then status is not "active".

Cross-system consistency checks: Compares data in different systems to ensure it is consistent. Systems may represent the same data differently, in which case comparison requires transformation (e.g., one system may store customer name in a single Name field as 'Doe, John Q', while another uses First_Name 'John' and Last_Name 'Doe' and Middle_Name 'Quality').

Data type checks: Checks input conformance with typed data. For example, an input box accepting numeric data may reject the letter 'O'.

File existence check: Checks that a file with a specified name exists. This check is essential for programs that use file handling.

Format check: Checks that the data is in a specified format (template), e.g., dates have to be in the format YYYY-MM-DD. Regular expressions may be used for this kind of validation.

Presence check: Checks that data is present, e.g., customers may be required to have an email address.

Range check: Checks that the data is within a specified range of values, e.g., a probability must be between 0 and 1.

Referential integrity: Values in two relational database tables can be linked through foreign key and primary key. If values in the foreign key field are not constrained by internal mechanisms, then they should be validated to ensure that the referencing table always refers to a row in the referenced table.

Spelling and grammar check: Looks for spelling and grammatical errors.

Uniqueness check: Checks that each value is unique. This can be applied to several fields (i.e. Address, First Name, Last Name).

Table look up check: A table look up check compares data to a collection of allowed values.

Post-validation actions

Enforcement Action: Enforcement action typically rejects the data entry request and requires the input actor to make a change that brings the data into compliance. This is most suitable for interactive use, where a real person is sitting on the computer and making entry. It also works well for batch upload, where a file input may be rejected and a set of messages sent back to the input source for why the data is rejected.; Another form of enforcement action involves automatically changing the data and saving a conformant version instead of the original version. This is most suitable for cosmetic change. For example, converting an [all-caps] entry to a [Pascal case] entry does not need user input. An inappropriate use of automatic enforcement would be in situations where the enforcement leads to loss of business information. For example, saving a truncated comment if the length is longer than expected. This is not typically a good thing since it may result in loss of significant data.

Advisory Action: Advisory actions typically allow data to be entered unchanged but sends a message to the source actor indicating those validation issues that were encountered. This is most suitable for non-interactive system, for systems where the change is not business critical, for cleansing steps of existing data and for verification steps of an entry process.

Verification Action: Verification actions are special cases of advisory actions. In this case, the source actor is asked to verify that this data is what they would really want to enter, in the light of a suggestion to the contrary. Here, the check step suggests an alternative (e.g., a check of a mailing address returns a different way of formatting that address or suggests a different address altogether). You would want in this case, to give the user the option of accepting the recommendation or keeping their version. This is not a strict validation process, by design and is useful for capturing addresses to a new location or to a location that is not yet supported by the validation databases.

Log of validation: Even in cases where data validation did not find any issues, providing a log of validations that were conducted and their results is important. This is helpful to identify any missing data validation checks in light of data issues and in improving

Validation and security

Failures or omissions in data validation can lead to data corruption or a security vulnerability.^[4] Data validation checks that data are fit for purpose,^[5] valid, sensible, reasonable and secure before they are processed.

Related Research Articles

<span class="mw-page-title-main">International Bank Account Number</span> Alphanumeric code that uniquely identifies a bank account in any participating country

The International Bank Account Number (IBAN) for example LV30RIKO0000083232646 is an internationally agreed upon system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors. An IBAN uniquely identifies the account of a customer at a financial institution. It was originally adopted by the European Committee for Banking Standards (ECBS) and since 1997 as the international standard ISO 13616 under the International Organization for Standardization (ISO). The current version is ISO 13616:2020, which indicates the Society for Worldwide Interbank Financial Telecommunication (SWIFT) as the formal registrar. Initially developed to facilitate payments within the European Union, it has been implemented by most European countries and numerous countries in other parts of the world, mainly in the Middle East and the Caribbean. By July 2024, 88 countries were using the IBAN numbering system.

<span class="mw-page-title-main">Software testing</span> Checking software against a standard

Software testing is the act of checking whether software satisfies expectations.

Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The term is broad in scope and may have widely different meanings depending on the specific context even under the same general umbrella of computing. It is at times used as a proxy term for data quality, while data validation is a prerequisite for data integrity.

In computer science, ACID is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequence of database operations that satisfies the ACID properties is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.

In computer software, business logic or domain logic is the part of the program that encodes the real-world business rules that determine how data can be created, stored, and changed. It is contrasted with the remainder of the software that might be concerned with lower-level details of managing a database or displaying the user interface, system infrastructure, or generally connecting various parts of the program.

In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type to every term. Usually the terms are various language constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term. For variables, the type system determines the allowed values of that term.

Extract, transform, load (ETL) is a three-phase computing process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collected from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on recurring schedules either as single jobs or aggregated into a batch of jobs.

An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to just the addr-spec in Section 3.4 of RFC 5322. The RFC defines address more broadly as either a mailbox or group. A mailbox value can be either a name-addr, which contains a display-name and addr-spec, or the more common addr-spec alone.

A product key, also known as a software key, serial key or activation key, is a specific software-based key for a computer program. It certifies that the copy of the program is original.

In engineering, a requirement is a condition that must be satisfied for the output of a work effort to be acceptable. It is an explicit, objective, clear and often quantitative description of a condition to be satisfied by a material, design, product, or service.

In software project management, software testing, and software engineering, verification and validation is the process of checking that a software engineer system meets specifications and requirements so that it fulfills its intended purpose. It may also be referred to as software quality control. It is normally the responsibility of software testers as part of the software development lifecycle. In simple terms, software verification is: "Assuming we should build X, does our software achieve its goals without any bugs or gaps?" On the other hand, software validation is: "Was X what we should have built? Does X meet the high-level requirements?"

In electronic design automation, a design rule is a geometric constraint imposed on circuit board, semiconductor device, and integrated circuit (IC) designers to ensure their designs function properly, reliably, and can be produced with acceptable yield. Design rules for production are developed by process engineers based on the capability of their processes to realize design intent. Electronic design automation is used extensively to ensure that designers do not violate design rules; a process called design rule checking (DRC). DRC is a major step during physical verification signoff on the design, which also involves LVS checks, XOR checks, ERC, and antenna checks. The importance of design rules and DRC is greatest for ICs, which have micro- or nano-scale geometries; for advanced processes, some fabs also insist upon the use of more restricted rules to improve yield.

Data cleansing or data cleaning is the process of identifying and correcting corrupt, inaccurate, or irrelevant records from a dataset, table, or database. It involves detecting incomplete, incorrect, or inaccurate parts of the data and then replacing, modifying, or deleting the affected data. Data cleansing can be performed interactively using data wrangling tools, or through batch processing often via scripts or a data quality firewall.

Executable UML is both a software development method and a highly abstract software language. It was described for the first time in 2002 in the book "Executable UML: A Foundation for Model-Driven Architecture". The language "combines a subset of the UML graphical notation with executable semantics and timing rules." The Executable UML method is the successor to the Shlaer–Mellor method.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

In computer programming, a magic string is an input that a programmer believes will never come externally and which activates otherwise hidden functionality. A user of this program would likely provide input that gives an expected response in most situations. However, if the user does in fact innocently provide the pre-defined input, invoking the internal functionality, the program response is often quite unexpected to the user.

Industrial process data validation and reconciliation, or more briefly, process data reconciliation (PDR), is a technology that uses process information and mathematical methods in order to automatically ensure data validation and reconciliation by correcting measurements in industrial processes. The use of PDR allows for extracting accurate and reliable information about the state of industry processes from raw measurement data and produces a single consistent set of data representing the most likely process operation.

In information technology a reasoning system is a software system that generates conclusions from available knowledge using logical techniques such as deduction and induction. Reasoning systems play an important role in the implementation of artificial intelligence and knowledge-based systems.

Clinical data management (CDM) is a critical process in clinical research, which leads to generation of high-quality, reliable, and statistically sound data from clinical trials. Clinical data management ensures collection, integration and availability of data at appropriate quality and cost. It also supports the conduct, management and analysis of studies across the spectrum of clinical research as defined by the National Institutes of Health (NIH). The ultimate goal of CDM is to ensure that conclusions drawn from research are well supported by the data. Achieving this goal protects public health and increases confidence in marketed therapeutics.

In business analysis, the Decision Model and Notation (DMN) is a standard published by the Object Management Group. It is a standard approach for describing and modeling repeatable decisions within organizations to ensure that decision models are interchangeable across organizations.

References

External links

Data Validation, OWASP
Input Validation, OWASP Cheat Sheet Series, github.com

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Methodology for data validation 1.0

[2] Data Validation, Data Integrity, Designing Distributed Applications with Visual Studio .NET

[3] Frequently Asked Questions about the new ISBN standard Archived 2007-06-10 at the Wayback Machine ISO.

[4] Chapter10. Data Validation

[5] More Efficient Data Validation with Spotless

[1]

[2]

[3]

[4]

[5]