Data preparation

Last updated

Data preparation is the act of manipulating (or pre-processing) raw data (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes. [1]

Contents

Data preparation is the first step in data analytics projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery. [2]

The issues to be dealt with fall into two main categories:

Data specification

The first step is to set out a full and detailed specification of the format of each data field and what the entries mean. This should take careful account of:

See also data definition specification.

Example

Suppose there is a two-character alphabetic field that indicates geographical location. It is possible that in one data source a code "EE" means "Europe" and in another data source the same code means "Estonia". One would need to devise an unambiguous set of codes and amend the code in one set of records accordingly.

Furthermore, the "geographical area" might refer to any of e.g. delivery address, billing address, address from which goods supplied, billing currency, or applicable national regulations. All these matters must be covered in the specification.

There could be some records with "X" or "555" in that field. Clearly, this is invalid data as it does not conform to the specification. If there are only small numbers of such records, one would either correct them manually or if precision is not important, simply delete those records from the file. Another possibility would be to create a "not known" category.

Other examples of invalid data requiring correction

Where possible and economic, data should be verified against an authoritative source (e.g. business information is referenced against a D&B database to ensure accuracy). [3] [4]

Given the variety of data sources (e.g. databases, business applications) that provide data and formats that data can arrive in, data preparation can be quite involved and complex. There are many tools and technologies [5] that are used for data preparation. The cost of cleaning the data should always be balanced against the value of the improved accuracy.

Self-service data preparation

Traditional tools and technologies, such as scripting languages or extract, transform, load (ETL) and data quality tools are not meant for business users. They typically require programming or IT skills that most business users don’t have.[ citation needed ]

Several companies, such as Paxata, Trifacta, Alteryx, Talend, and Ataccama provide visual interfaces that display the data and allow the user to directly explore, structure, clean, augment, and update sample data provided by the user.

Once the preparation work is complete, the underlying steps can be run on other datasets to perform the same operations. This reuse provides a significant productivity boost when compared to more traditional manual and hand-coding methods for data preparation.

See also

Related Research Articles

Computer programming or coding is the composition of sequences of instructions, called programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of procedures, by writing code in one or more programming languages. Programmers typically use high-level programming languages that are more easily intelligible to humans than machine code, which is directly executed by the central processing unit. Proficient programming usually requires expertise in several different subjects, including knowledge of the application domain, details of programming languages and generic code libraries, specialized algorithms, and formal logic.

<span class="mw-page-title-main">Geographic information system</span> System to capture, manage and present geographic data

A geographic information system (GIS) consists of integrated computer hardware and software that store, manage, analyze, edit, output, and visualize geographic data. Much of this often happens within a spatial database, however, this is not essential to meet the definition of a GIS. In a broader sense, one may consider such a system also to include human users and support staff, procedures and workflows, the body of knowledge of relevant concepts and methods, and institutional organizations.

Software testing is the act of examining the artifacts and the behavior of the software under test by verification and validation. Software testing can also provide an objective, independent view of the software to allow the business to appreciate and understand the risks of software implementation. Test techniques include, but are not limited to:

Abstract Syntax Notation One (ASN.1) is a standard interface description language (IDL) for defining data structures that can be serialized and deserialized in a cross-platform way. It is broadly used in telecommunications and computer networking, and especially in cryptography.

<span class="mw-page-title-main">Internationalization and localization</span> Process of making software accessible to people in different areas of the world

In computing, internationalization and localization (American) or internationalisation and localisation (British), often abbreviated i18n and l10n respectively, are means of adapting computer software to different languages, regional peculiarities and technical requirements of a target locale.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reccurring schedules either as single jobs or aggregated into a batch of jobs.

Job Control Language (JCL) is a name for scripting languages used on IBM mainframe operating systems to instruct the system on how to run a batch job or start a subsystem. The purpose of JCL is to say which programs to run, using which files or devices for input or output, and at times to also indicate under what conditions to skip a step. Parameters in the JCL can also provide accounting information for tracking the resources used by a job as well as which machine the job should run on.

A FourCC is a sequence of four bytes used to uniquely identify data formats. It originated from the OSType or ResType metadata system used in classic Mac OS and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives. The idea was later reused to identify compressed data types in QuickTime and DirectShow.

This article discusses support programs included in or available for OS/360 and successors. IBM categorizes some of these programs as utilities and others as service aids; the boundaries are not always consistent or obvious. Many, but not all, of these programs match the types in utility software.

In computing, data validation or input validation is the process of ensuring data has undergone data cleansing to confirm they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic of the computer and its application.

<span class="mw-page-title-main">Shapefile</span> Geospatial vector data format

The shapefile format is a geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri as a mostly open specification for data interoperability among Esri and other GIS software products. The shapefile format can spatially describe vector features: points, lines, and polygons, representing, for example, water wells, rivers, and lakes. Each item usually has attributes that describe it, such as name or temperature.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall.

In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.

Spatial extract, transform, load, also known as geospatial transformation and load (GTL), provides the data processing functionality of traditional extract, transform, load (ETL) software, but with a primary focus on the ability to manage spatial data.

Data preprocessing can refer to manipulation, filtration or augmentation of data before it is analyzed, and is often an important step in the data mining process. Data collection methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, amongst other issues.

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.

Tool management is needed in metalworking so that the information regarding the tools on hand can be uniformly organized and integrated. The information is stored in a database and is registered and applied using tool management. Tool data management consists of specific data fields, graphics and parameters that are essential in production, as opposed to managing general production equipment.

The Science of Science (Sci2) Tool is a modular toolset specifically designed for the study of science. It supports the temporal, geospatial, topical, and network analysis and visualization of datasets at the micro (individual), meso (local), and macro (global) levels. Users of the tool can:

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data.

References

  1. Friedland, David (September 7, 2016). "A Fresh Look at Data Preparation". IRI (Blog Article). IRI, The CoSort Company.
  2. Pyle, Dorian (April 5, 1999). Data Preparation for Data Mining. Morgan Kaufmann. ISBN   9781558605299 via Google Books.
  3. "salesify".
  4. Data Preparation Article [ permanent dead link ]
  5. "Tools / Languages for Data Cleaning". www.kdnuggets.com (Poll).