Dirty data

Last updated

Dirty data, also known as rogue data, [1] are inaccurate, incomplete or inconsistent data, especially in a computer system or database. [2]

Contents

Dirty data can contain such mistakes as spelling or punctuation errors, incorrect data associated with a field, incomplete or outdated data, or even data that has been duplicated in the database. They can be cleaned through a process known as data cleansing. [3]

Dirty Data (Social Science)

In sociology, dirty data refer to secretive data the discovery of which is discrediting to those who kept the data secret. Following the definition of Gary T. Marx, Professor Emeritus of MIT, dirty data are one among four types of data: [4]

See also

Related Research Articles

Bioinformatics Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Database Organized collection of data

In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues including supporting concurrent access and fault tolerance.

Kerberos is a computer-network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Its designers aimed it primarily at a client–server model, and it provides mutual authentication—both the user and the server verify each other's identity. Kerberos protocol messages are protected against eavesdropping and replay attacks.

Steganography is the practice of concealing a message within another message or a physical object. In computing/electronic contexts, a computer file, message, image, or video is concealed within another file, message, image, or video. The word steganography comes from Greek steganographia, which combines the words steganós, meaning "covered or concealed", and -graphia meaning "writing".

Data mining Process of extracting and discovering patterns in large data sets

Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Secrecy Practice of hiding information to certain individual or group for personal or interpersonal reason

Secrecy is the practice of hiding information from certain individuals or groups who do not have the "need to know", perhaps while sharing it with other individuals. That which is kept hidden is known as the secret.

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it — with unobservable ("hidden") states. As part of the definition, HMM requires that there be an observable process whose outcomes are "influenced" by the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about by observing HMM has an additional requirement that the outcome of at time may be "influenced" exclusively by the outcome of at and that the outcomes of and at must not affect the outcome of at

Multiversion concurrency control, is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.

Machine learning Study of algorithms that improve automatically through experience

Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Easter egg (media) Intentional inside joke, hidden message or image, or secret feature of a work

An Easter egg is a message, image, or feature hidden in software, a video game, a film, or another, usually electronic, medium. The term used in this manner was coined around 1979 by Steve Wright, the then-Director of Software Development in the Atari Consumer Division, to describe a hidden message in the Atari video game Adventure, in reference to an Easter egg hunt. The earliest known video game Easter egg is in Moonlander (1973), in which the player tries to land a spaceship on the moon; if the player flies horizontally enough, they encounter a McDonald's restaurant and if they land next to it an astronaut will visit it instead of standing next to the ship. The earliest known Easter egg in software in general is one placed in the "make" command for PDP-6/PDP-10 computers sometime in October 1967–October 1968, wherein if the user attempts to create a file named "love" by typing "make love", the program responds "not war?" before proceeding.

Algorithmic learning theory is a mathematical framework for analyzing machine learning problems and algorithms. Synonyms include formal learning theory and algorithmic inductive inference. Algorithmic learning theory is different from statistical learning theory in that it does not make use of statistical assumptions and analysis. Both algorithmic and statistical learning theory are concerned with machine learning and can thus be viewed as branches of computational learning theory.

Time series Sequence of data points over time

In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

Edward Norton Lorenz American mathematician

Edward Norton Lorenz was an American mathematician and meteorologist who established the theoretical basis of weather and climate predictability, as well as the basis for computer-aided atmospheric physics and meteorology. He is best known as the founder of modern chaos theory, a branch of mathematics focusing on the behavior of dynamical systems that are highly sensitive to initial conditions.

Operation Neptune (espionage)

Operation Neptune was a 1964 disinformation operation by the Czechoslovak secret service, the StB, involving Nazi-era documents.

Pan-STARRS Multi-telescope astronomical survey

The Panoramic Survey Telescope and Rapid Response System located at Haleakala Observatory, Hawaii, US, consists of astronomical cameras, telescopes and a computing facility that is surveying the sky for moving or variable objects on a continual basis, and also producing accurate astrometry and photometry of already-detected objects. In January 2019 the second Pan-STARRS data release was announced. At 1.6 petabytes, it is the largest volume of astronomical data ever released.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.

Electronic discovery refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format. Electronic discovery is subject to rules of civil procedure and agreed-upon processes, often involving review for privilege and relevance before data are turned over to the requesting party.

In computer science, an opaque data type is a data type whose concrete data structure is not defined in an interface. This enforces information hiding, since its values can only be manipulated by calling subroutines that have access to the missing information. The concrete representation of the type is hidden from its users, and the visible implementation is incomplete. A data type whose representation is visible is called transparent. Opaque data types are frequently used to implement abstract data types.

Top Secret America

Top Secret America is a series of investigative articles published on the post-9/11 growth of the United States Intelligence Community. The report was first published in The Washington Post on July 19, 2010, by Pulitzer Prize-winning author Dana Priest and William Arkin.

ChEMBL Chemical database of bioactive molecules with drug-like properties

ChEMBL or ChEMBLdb is a manually curated chemical database of bioactive molecules with drug-like properties. It is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

References

  1. Spotless version 12 out now
  2. Margaret Chu (2004), "What Are Dirty Data?", Blissful Data, p. 71 et seq, ISBN   9780814407806
  3. Wu, S. (2013), "A review on coarse warranty data and analysis" (PDF), Reliability Engineering and System, 114: 1–11, doi:10.1016/j.ress.2012.12.021
  4. "Notes on the discovery, collection, and assessment of hidden and". web.mit.edu. Retrieved 2017-02-17.