Data wrangling

Last updated

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

Contents

The process of data wrangling may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data wrangling typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use. [1] It is closely aligned with the ETL process.

Background

The "wrangler" non-technical term is often said to derive from work done by the United States Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) and their program partner the Emory University Libraries based MetaArchive Partnership. The term "mung" has roots in munging as described in the Jargon File. [2] The term "data wrangler" was also suggested as the best analogy to describe someone working with data. [3]

One of the first mentions of data wrangling in a scientific context was by Donald Cline during the NASA/NOAA Cold Lands Processes Experiment. [4] Cline stated the data wranglers "coordinate the acquisition of the entire collection of the experiment data." Cline also specifies duties typically handled by a storage administrator for working with large amounts of data. This can occur in areas like major research projects and the making of films with a large amount of complex computer-generated imagery. In research, this involves both data transfer from research instrument to storage grid or storage facility as well as data manipulation for re-analysis via high-performance computing instruments or access via cyberinfrastructure-based digital libraries.

With the upcoming of artificial intelligence in data science it has become increasingly important for automation of data wrangling to have very strict checks and balances, which is why the munging process of data has not been automated by machine learning. Data munging requires more than just an automated solution, it requires knowledge of what information should be removed and artificial intelligence is not to the point of understanding such things. [5]

Connection to data mining

Data wrangling is a superset of data mining and requires processes that some data mining uses, but not always. The process of data mining is to find patterns within large data sets, where data wrangling transforms data in order to deliver insights about that data. Even though data wrangling is a superset of data mining does not mean that data mining does not use it, there are many use cases for data wrangling in data mining. Data wrangling can benefit data mining by removing data that does not benefit the overall set, or is not formatted properly, which will yield better results for the overall data mining process.

An example of data mining that is closely related to data wrangling is ignoring data from a set that is not connected to the goal: say there is a data set related to the state of Texas and the goal is to get statistics on the residents of Houston, the data in the set related to the residents of Dallas is not useful to the overall set and can be removed before processing to improve the efficiency of the data mining process.

Benefits

With an increase of raw data comes an increase in the amount of data that is not inherently useful, this increases time spent on cleaning and organizing data before it can be analyzed which is where data wrangling comes into play. The result of data wrangling can provide important metadata statistics for further insights about the data, it is important to ensure metadata is consistent otherwise it can cause roadblocks. Data wrangling allows analysts to analyze more complex data more quickly, achieve more accurate results, and because of this better decisions can be made. Many businesses have moved to data wrangling because of the success that it has brought.

Core ideas

Turning messy data into useful statistics Data Wrangling From Messy To Clean Data Management.jpg
Turning messy data into useful statistics

The main steps in data wrangling are as follows:

  1. Data discovery

    This all-encompassing term describes how to understand your data. This is the first step to familiarize yourself with your data.

  2. Structuring
    The next step is to organize the data. Raw data is typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
  3. Cleaning
    There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results and also formatting null values. This step is important in assuring the overall quality of the data.
  4. Enriching
    At this step determine whether or not additional data would benefit the data set that could be easily added.
  5. Validating
    This step is similar to structuring and cleaning. Use repetitive sequences of validation rules to assure data consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
  6. Publishing
    Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

These steps are an iterative process that should yield a clean and usable data set that can then be used for analysis. This process is tedious but rewarding as it allows analysts to get the information they need out of a large set of data that would otherwise be unreadable.

Starting data
NamePhoneBirth dateState
John, Smith445-881-4478August 12, 1989Maine
Jennifer Tal+1-189-456-451311/12/1965Tx
Gates, Bill(876)546-8165June 15, 72Kansas
Alan Fitch54931566482-6-1985Oh
Jacob Alan156-4896January 3Alabama
Result
NamePhoneBirth dateState
John Smith445-881-44781989-08-12Maine
Jennifer Tal189-456-45131965-11-12Texas
Bill Gates876-546-81651972-06-15Kansas
Alan Fitch549-315-66481985-02-06Ohio

The result of using the data wrangling process on this small data set shows a significantly easier data set to read. All names are now formatted the same way, {first name last name}, phone numbers are also formatted the same way {area code-XXX-XXXX}, dates are formatted numerically {YYYY-mm-dd}, and states are no longer abbreviated. The entry for Jacob Alan did not have fully formed data (the area code on the phone number is missing and the birth date had no year), so it was discarded from the data set. Now that the resulting data set is cleaned and readable, it is ready to be either deployed or evaluated.

Typical use

The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values, etc.) within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering to create desired wrangling outputs that can be leveraged downstream.

The recipients could be individuals, such as data architects or data scientists who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such as data warehouses, data lakes, or downstream applications.

Modus operandi

Depending on the amount and format of the incoming data, data wrangling has traditionally been performed manually (e.g. via spreadsheets such as Excel), tools like KNIME or via scripts in languages such as Python or SQL. R, a language often used in data mining and statistical data analysis, is now also sometimes used for data wrangling. [6] Data wranglers typically have skills sets within: R or Python, SQL, PHP, Scala, and more languages typically used for analyzing data.

Visual data wrangling systems were developed to make data wrangling accessible for non-programmers, and simpler for programmers. Some of these also include embedded AI recommenders and programming by example facilities to provide user assistance, and program synthesis techniques to autogenerate scalable dataflow code. Early prototypes of visual data wrangling tools include OpenRefine and the Stanford/Berkeley Wrangler research system; [7] the latter evolved into Trifacta.

Other terms for these processes have included data franchising, [8] data preparation, and data munging.

Example

Given a set of data that contains information on medical patients your goal is to find correlation for a disease. Before you can start iterating through the data ensure that you have an understanding of the result, are you looking for patients who have the disease? Are there other diseases that can be the cause? Once an understanding of the outcome is achieved then the data wrangling process can begin.

Start by determining the structure of the outcome, what is important to understand the disease diagnosis.

Once a final structure is determined, clean the data by removing any data points that are not helpful or are malformed, this could include patients that have not been diagnosed with any disease.

After cleaning look at the data again, is there anything that can be added to the data set that is already known that would benefit it? An example could be most common diseases in the area, America and India are very different when it comes to most common diseases.

Now comes the validation step, determine validation rules for which data points need to be checked for validity, this could include date of birth or checking for specific diseases.

After the validation step the data should now be organized and prepared for either deployment or evaluation. This process can be beneficial for determining correlations for disease diagnosis as it will reduce the vast amount of data into something that can be easily analyzed for an accurate result.

See also

Related Research Articles

<span class="mw-page-title-main">Data mining</span> Process of extracting and discovering patterns in large data sets

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reoccurring schedules either as single jobs or aggregated into a batch of jobs.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing research. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

In science and engineering, root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. It is widely used in IT operations, manufacturing, telecommunications, industrial process control, accident analysis, medicine, healthcare industry, etc. Root cause analysis is a form of inductive and deductive inference.

<span class="mw-page-title-main">Systems development life cycle</span> Systems engineering terms

In systems engineering, information systems and software engineering, the systems development life cycle (SDLC), also referred to as the application development life cycle, is a process for planning, creating, testing, and deploying an information system. The SDLC concept applies to a range of hardware and software configurations, as a system can be composed of hardware only, software only, or a combination of both. There are usually six stages in this cycle: requirement analysis, design, development and testing, implementation, documentation, and evaluation.

<span class="mw-page-title-main">Wrangler (profession)</span> Animal (especially horse) handler

In North America, a wrangler is someone employed to professionally handle animals, especially horses and cattle, but sometimes other types of animals as well. The word "wrangler" is derived from the Low German "wrangeln" meaning "to dispute" or "to wrestle". It was first documented in 1377. Its use as a noun was first recorded in 1547. Its reference to a "person in charge of horses or cattle" or "herder" was first recorded in 1888.

In computer science, data validation is the process of ensuring data has undergone data cleansing to confirm they have data quality, that is, that they are both correct and useful. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic of the computer and its application.

A camera raw image file contains unprocessed or minimally processed data from the image sensor of either a digital camera, a motion picture film scanner, or other image scanner. Raw files are so named because they are not yet processed, and contain large amounts of potentially redundant data. Normally, the image is processed by a raw converter, in a wide-gamut internal color space where precise adjustments can be made before conversion to a viewable file format such as JPEG or PNG for storage, printing, or further manipulation. There are dozens of raw formats in use by different manufacturers of digital image capture equipment.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall.

DMAIC or define, measure, analyze, improve and control refers to a data-driven improvement cycle used for improving, optimizing and stabilizing business processes and designs. The DMAIC improvement cycle is the core tool used to drive Six Sigma projects. However, DMAIC is not exclusive to Six Sigma and can be used as the framework for other improvement applications.

Telecommunications mediation is a process that converts call data to pre-defined layouts that can be imported by a specific billing system or other OSS applications.

In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.

<span class="mw-page-title-main">Raw data</span> A collection of information which has not been fully processed or analyzed

Raw data, also known as primary data, are data collected from a source. In the context of examinations, the raw data might be described as a raw score.

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data collection methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, amongst other issues.

Technology Intelligence (TI) is an activity that enables companies to identify the technological opportunities and threats that could affect the future growth and survival of their business. It aims to capture and disseminate the technological information needed for strategic planning and decision making. As technology life cycles shorten and business become more globalized having effective TI capabilities is becoming increasingly important.

<span class="mw-page-title-main">Trifacta</span>

Trifacta is a privately owned software company headquartered in San Francisco with offices in Bengaluru, Boston, Berlin and London. The company was founded in October 2012 and primarily develops data wrangling software for data exploration and self-service data preparation on cloud and on-premises data platforms.

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data.

Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a data analytics process.

Data preparation is the act of manipulating raw data into a form that can readily and accurately be analysed, e.g. for business purposes.

<span class="mw-page-title-main">Sean Kandel</span>

Sean Kandel is Trifacta's Chief Technical Officer and Co-founder, along with Joseph M. Hellerstein and Jeffrey Heer. He is known for the development of new tools for data transformation and discovery and is the co-developed of Data Wrangler, an interactive tool for data cleaning and transformation.

References

  1. "What Is Data Munging?". Archived from the original on 2013-08-18. Retrieved 2022-01-21.
  2. "mung". Mung. Jargon File. Archived from the original on 2012-09-18. Retrieved 2012-10-10.
  3. As coder is for code, X is for data Archived 2021-04-15 at the Wayback Machine , Open Knowledge Foundation blog post
  4. Parsons, M. A.; Brodzik, M. J.; Rutter, N. J. (2004). "Data management for the Cold Land Processes Experiment: improving hydrological science". Hydrological Processes. 18 (18): 3637–3653. Bibcode:2004HyPr...18.3637P. doi: 10.1002/hyp.5801 . S2CID   129774847.
  5. "What Is Data Wrangling? What are the steps in data wrangling?". Express Analytics. 2020-04-22. Archived from the original on 2020-11-01. Retrieved 2020-12-06.
  6. Wickham, Hadley; Grolemund, Garrett (2016). "Chapter 9: Data Wrangling Introduction". R for data science : import, tidy, transform, visualize, and model data (First ed.). Sebastopol, CA. ISBN   978-1491910399. Archived from the original on 2021-10-11. Retrieved 2022-01-12.{{cite book}}: CS1 maint: location missing publisher (link)
  7. Kandel, Sean; Paepcke, Andreas (May 2011). "Wrangler: Interactive Visual Specification of Data Transformation Scripts". SIGCHI. doi:10.1145/1978942.1979444. S2CID   11133756.
  8. What is Data Franchising? (2003 and 2017 IRI) Archived 2021-04-15 at the Wayback Machine