Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning". [1] [2] [3] Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case, data governance is used to form agreed upon definitions and standards for data quality. In such cases, data cleansing, including standardization, may be required in order to ensure data quality. [4]
Defining data quality is difficult due to the many contexts data are used in, as well as the varying perspectives among end users, producers, and custodians of data. [5]
From a consumer perspective, data quality is: [5]
From a business perspective, data quality is:
From a standards-based perspective, data quality is:
Arguably, in all these cases, "data quality" is a comparison of the actual state of a particular set of data to a desired state, with the desired state being typically referred to as "fit for use," "to specification," "meeting consumer expectations," "free of defect," or "meeting requirements." These expectations, specifications, and requirements are usually defined by one or more individuals or groups, standards organizations, laws and regulations, business policies, or software development policies. [5]
Drilling down further, those expectations, specifications, and requirements are stated in terms of characteristics or dimensions of the data, such as: [5] [6] [7] [8] [11]
A systematic scoping review of the literature suggests that data quality dimensions and methods with real world data are not consistent in the literature, and as a result quality assessments are challenging due to the complex and heterogeneous nature of these data. [11]
Before the rise of the inexpensive computer data storage, massive mainframe computers were used to maintain name and address data for delivery services. This was so that mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry (NCOA). This technology saved large companies millions of dollars in comparison to manual correction of customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.[ citation needed ]
Companies with an emphasis on marketing often focused their quality efforts on name and address information, but data quality is recognized[ by whom? ] as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking and shipping parts across a large organization.[ citation needed ]
For companies with significant research efforts, data quality can include developing protocols for research methods, reducing measurement error, bounds checking of data, cross tabulation, modeling and outlier detection, verifying data integrity, etc.[ citation needed ]
There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework, dubbed "Zero Defect Data" (Hansen, 1991) adapts the principles of statistical process control to data quality. Another framework seeks to integrate the product perspective (conformance to specifications) and the service perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in semiotics to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously (Wand and Wang, 1996).
A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognize this as a similar problem to "ilities".
MIT has an Information Quality (MITIQ) Program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ). This program grew out of the work done by Hansen on the "Zero Defect Data" framework (Hansen, 1991).
In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence to customer relationship management and supply chain management. One industry study estimated the total cost to the U.S. economy of data quality problems at over U.S. $600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data migration and conversion projects. [12]
In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed. [13]
One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year. [14]
In fact, the problem is such a concern that companies are beginning to set up a data governance team whose sole role in the corporation is to be responsible for data quality. In some[ who? ] organizations, this data governance function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.
Problems with data quality don't only arise from incorrect data; inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.
Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data. [15]
The market is going some way to providing data quality assurance. A number of vendors make tools for analyzing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:
ISO 8000 is an international standard for data quality. [16]
Data quality assurance is the process of data profiling to discover inconsistencies and other anomalies in the data, as well as performing data cleansing [17] [18] activities (e.g. removing outliers, missing data interpolation) to improve the data quality.
These activities can be undertaken as part of data warehousing or as part of the database administration of an existing piece of application software. [19]
Data quality control is the process of controlling the usage of data for an application or a process. This process is performed both before and after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction.
Before:
After QA process the following statistics are gathered to guide the Quality Control (QC) process:
The Data QC process uses the information from the QA process to decide to use the data for analysis or in an application or business process. General example: if a Data QC process finds that the data contains too many errors or inconsistencies, then it prevents that data from being used for its intended process which could cause disruption. Specific example: providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing a QC process provides data usage protection.[ citation needed ]
Data Quality (DQ) is a niche area required for the integrity of the data management by covering gaps of data issues. This is one of the key functions that aid data governance by monitoring data to find exceptions undiscovered by current data management operations. Data Quality checks may be defined at attribute level to have full control on its remediation steps.[ citation needed ]
DQ checks and business rules may easily overlap if an organization is not attentive of its DQ scope. Business teams should understand the DQ scope thoroughly in order to avoid overlap. Data quality checks are redundant if business logic covers the same functionality and fulfills the same purpose as DQ. The DQ scope of an organization should be defined in DQ strategy and well implemented. Some data quality checks may be translated into business rules after repeated instances of exceptions in the past.[ citation needed ]
Below are a few areas of data flows that may need perennial DQ checks:
Completeness and precision DQ checks on all data may be performed at the point of entry for each mandatory attribute from each source system. Few attribute values are created way after the initial creation of the transaction; in such cases, administering these checks becomes tricky and should be done immediately after the defined event of that attribute's source and the transaction's other core attribute conditions are met.
All data having attributes referring to Reference Data in the organization may be validated against the set of well-defined valid values of Reference Data to discover new or discrepant values through the validity DQ check. Results may be used to update Reference Data administered under Master Data Management (MDM).
All data sourced from a third party to organization's internal teams may undergo accuracy (DQ) check against the third party data. These DQ check results are valuable when administered on data that made multiple hops after the point of entry of that data but before that data becomes authorized or stored for enterprise intelligence.
All data columns that refer to Master Data may be validated for its consistency check. A DQ check administered on the data at the point of entry discovers new data for the MDM process, but a DQ check administered after the point of entry discovers the failure (not exceptions) of consistency.
As data transforms, multiple timestamps and the positions of that timestamps are captured and may be compared against each other and its leeway to validate its value, decay, operational significance against a defined SLA (service level agreement). This timeliness DQ check can be utilized to decrease data value decay rate and optimize the policies of data movement timeline.
In an organization complex logic is usually segregated into simpler logic across multiple processes. Reasonableness DQ checks on such complex logic yielding to a logical result within a specific range of values or static interrelationships (aggregated business rules) may be validated to discover complicated but crucial business processes and outliers of the data, its drift from BAU (business as usual) expectations, and may provide possible exceptions eventually resulting into data issues. This check may be a simple generic aggregation rule engulfed by large chunk of data or it can be a complicated logic on a group of attributes of a transaction pertaining to the core business of the organization. This DQ check requires high degree of business knowledge and acumen. Discovery of reasonableness issues may aid for policy and strategy changes by either business or data governance or both.
Conformity checks and integrity checks need not covered in all business needs, it's strictly under the database architecture's discretion.
There are many places in the data movement where DQ checks may not be required. For instance, DQ check for completeness and precision on not–null columns is redundant for the data sourced from database. Similarly, data should be validated for its accuracy with respect to time when the data is stitched across disparate sources. However, that is a business rule and should not be in the DQ scope.[ citation needed ]
Regretfully, from a software development perspective, DQ is often seen as a nonfunctional requirement. And as such, key data quality checks/processes are not factored into the final software solution. Within Healthcare, wearable technologies or Body Area Networks, generate large volumes of data. [20] The level of detail required to ensure data quality is extremely high and is often underestimated. This is also true for the vast majority of mHealth apps, EHRs and other health related software solutions. However, some open source tools exist that examine data quality. [21] The primary reason for this, stems from the extra cost involved is added a higher degree of rigor within the software architecture.
The use of mobile devices in health, or mHealth, creates new challenges to health data security and privacy, in ways that directly affect data quality. [2] mHealth is an increasingly important strategy for delivery of health services in low- and middle-income countries. [22] Mobile phones and tablets are used for collection, reporting, and analysis of data in near real time. However, these mobile devices are commonly used for personal activities, as well, leaving them more vulnerable to security risks that could lead to data breaches. Without proper security safeguards, this personal use could jeopardize the quality, security, and confidentiality of health data. [23]
Data quality has become a major focus of public health programs in recent years, especially as demand for accountability increases. [24] Work towards ambitious goals related to the fight against diseases such as AIDS, Tuberculosis, and Malaria must be predicated on strong Monitoring and Evaluation systems that produce quality data related to program implementation. [25] These programs, and program auditors, increasingly seek tools to standardize and streamline the process of determining the quality of data, [26] verify the quality of reported data, and assess the underlying data management and reporting systems for indicators. [27] An example is WHO and MEASURE Evaluation's Data Quality Review Tool [28] WHO, the Global Fund, GAVI, and MEASURE Evaluation have collaborated to produce a harmonized approach to data quality assurance across different diseases and programs. [29]
There are a number of scientific works devoted to the analysis of the data quality in open data sources, such as Wikipedia, Wikidata, DBpedia and other. In the case of Wikipedia, quality analysis may relate to the whole article [30] Modeling of quality there is carried out by means of various methods. Some of them use machine learning algorithms, including Random Forest, [31] Support Vector Machine, [32] and others. Methods for assessing data quality in Wikidata, DBpedia and other LOD sources differ. [33]
The Electronic Commerce Code Management Association (ECCMA) is a member-based, international not-for-profit association committed to improving data quality through the implementation of international standards. ECCMA is the current project leader for the development of ISO 8000 and ISO 22745, which are the international standards for data quality and the exchange of material and service master data, respectively. ECCMA provides a platform for collaboration amongst subject experts on data quality and data governance around the world to build and maintain global, open standard dictionaries that are used to unambiguously label information. The existence of these dictionaries of labels allows information to be passed from one computer system to another without losing meaning. [35]
Risk management is the identification, evaluation, and prioritization of risks followed by coordinated and economical application of resources to minimize, monitor, and control the probability or impact of unfortunate events or to maximize the realization of opportunities.
The ISO 14000 family of standards by the International Organization for Standardization (ISO) relate to environmental management that exists to help organizations (a) minimize how their operations negatively affect the environment ; (b) comply with applicable laws, regulations, and other environmentally oriented requirements; and (c) continually improve in the above.
An audit is an "independent examination of financial information of any entity, whether profit oriented or not, irrespective of its size or legal form when such an examination is conducted with a view to express an opinion thereon." Auditing also attempts to ensure that the books of accounts are properly maintained by the concern as required by law. Auditors consider the propositions before them, obtain evidence, roll forward prior year working papers, and evaluate the propositions in their auditing report.
Health Level Seven, abbreviated to HL7, is a range of global standards for the transfer of clinical and administrative health data between applications with the aim to improve patient outcomes and health system performance. The HL7 standards focus on the application layer, which is "layer 7" in the Open Systems Interconnection model. The standards are produced by Health Level Seven International, an international standards organization, and are adopted by other standards issuing bodies such as American National Standards Institute and International Organization for Standardization. There are a range of primary standards that are commonly used across the industry, as well as secondary standards which are less frequently adopted.
ISO/IEC/IEEE 12207Systems and software engineering – Software life cycle processes is an international standard for software lifecycle processes. First introduced in 1995, it aims to be a primary standard that defines all the processes required for developing and maintaining software systems, including the outcomes and/or activities of each process.
Information technology (IT)governance is a subset discipline of corporate governance, focused on information technology (IT) and its performance and risk management. The interest in IT governance is due to the ongoing need within organizations to focus value creation efforts on an organization's strategic objectives and to better manage the performance of those responsible for creating this value in the best interest of all stakeholders. It has evolved from The Principles of Scientific Management, Total Quality Management and ISO 9001 Quality Management System.
Business process modeling (BPM), mainly used in business process management; software development, or systems engineering, is the action of capturing and representing processes of an enterprise, so that the current business processes may be analyzed, applied securely and consistently, improved, and automated. BPM is typically orchestrated by business analysts, leveraging their expertise in modeling practices. Subject matter experts, equipped with specialized knowledge of the processes being modeled, often collaborate within these teams. Alternatively, process models can be directly derived from digital traces within IT systems, such as event logs, utilizing process mining tools.
In the context of software engineering, software quality refers to two related but distinct notions:
In general, compliance means conforming to a rule, such as a specification, policy, standard or law. Compliance has traditionally been explained by reference to deterrence theory, according to which punishing a behavior will decrease the violations both by the wrongdoer and by others. This view has been supported by economic theory, which has framed punishment in terms of costs and has explained compliance in terms of a cost-benefit equilibrium. However, psychological research on motivation provides an alternative view: granting rewards or imposing fines for a certain behavior is a form of extrinsic motivation that weakens intrinsic motivation and ultimately undermines compliance.
Quality management ensures that an organization, product or service consistently functions well. It has four main components: quality planning, quality assurance, quality control, and quality improvement. Quality management is focused both on product and service quality and the means to achieve it. Quality management, therefore, uses quality assurance and control of processes as well as products to achieve more consistent quality. Quality control is also part of quality management. What a customer wants and is willing to pay for it, determines quality. It is a written or unwritten commitment to a known or unknown consumer in the market. Quality can be defined as how well the product performs its intended function.
Information quality (IQ) is the quality of the content of information systems. It is often pragmatically defined as: "The fitness for use of the information provided". IQ frameworks also provides a tangible approach to assess and measure DQ/IQ in a robust and rigorous manner.
Information technology controls are specific activities performed by persons or systems to ensure that computer systems operate in a way that minimises risk. They are a subset of an organisation's internal control. IT control objectives typically relate to assuring the confidentiality, integrity, and availability of data and the overall management of the IT function. IT controls are often described in two categories: IT general controls (ITGC) and IT application controls. ITGC includes controls over the hardware, system software, operational processes, access to programs and data, program development and program changes. IT application controls refer to controls to ensure the integrity of the information processed by the IT environment. Information technology controls have been given increased prominence in corporations listed in the United States by the Sarbanes-Oxley Act. The COBIT Framework is a widely used framework promulgated by the IT Governance Institute, which defines a variety of ITGC and application control objectives and recommended evaluation approaches.
Operational risk management (ORM) is defined as a continual recurring process that includes risk assessment, risk decision making, and the implementation of risk controls, resulting in the acceptance, mitigation, or avoidance of risk.
Internal auditing is an independent, objective assurance and consulting activity designed to add value and improve an organization's operations. It helps an organization accomplish its objectives by bringing a systematic, disciplined approach to evaluate and improve the effectiveness of risk management, control and governance processes. Internal auditing might achieve this goal by providing insight and recommendations based on analyses and assessments of data and business processes. With commitment to integrity and accountability, internal auditing provides value to governing bodies and senior management as an objective source of independent advice. Professionals called internal auditors are employed by organizations to perform the internal auditing activity.
A Publicly Available Specification or PAS is a standardization document that closely resembles a formal standard in structure and format but which has a different development model. The objective of a Publicly Available Specification is to speed up standardization. PASs are often produced in response to an urgent market need.
Security level management (SLM) comprises a quality assurance system for information system security.
eOTD is the acronym for the ECCMA Open Technical Dictionary. The dictionary is a language-independent database of concepts with associated terms, definitions and images used to unambiguously describe individuals, organizations, locations, goods, services, processes, rules, and regulations. The eOTD is maintained by the Electronic Commerce Code Management Association (ECCMA).
Software intelligence is insight into the inner workings and structural condition of software assets produced by software designed to analyze database structure, software framework and source code to better understand and control complex software systems in information technology environments. Similarly to business intelligence (BI), software intelligence is produced by a set of software tools and techniques for the mining of data and the software's inner-structure. Results are automatically produced and feed a knowledge base containing technical documentation and blueprints of the innerworking of applications, and make it available to all to be used by business and software stakeholders to make informed decisions, measure the efficiency of software development organizations, communicate about the software health, prevent software catastrophes.
Standardisation Testing and Quality Certification (STQC) Directorate, established in 1980, is an authoritative body offering quality assurance services to IT and Electronics domains.
Having a standardized data governance program in place means cleaning up corrupted or duplicated data and providing users with clean, accurate data as a basis for line-of-business software applications and for decision support analytics in business intelligence (BI) applications.
{{cite book}}
: CS1 maint: multiple names: authors list (link){{cite book}}
: CS1 maint: multiple names: authors list (link)Validity refers to the usefulness, accuracy, and correctness of data for its application. Traditionally, this has been referred to as data quality.