Data exploration

Last updated

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. [1] These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data.

Contents

Data exploration is typically conducted using a combination of automated and manual activities. [1] [2] [3] Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics. [1]

This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data. [4]

All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis. [1]

Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements and defining relevant relationships across datasets. [2] This process is also known as determining data quality. [4]

Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand. [1]

Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field. [5] Today, data exploration is more widespread and is the focus of data analysts and data scientists; the latter being a relatively new role within enterprises and larger organizations.

Interactive Data Exploration

This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving. [4] As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. [2] Many common patterns include regression and classification or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning.

By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques. [6]

Software

See also

Related Research Articles

Data mining Process of extracting and discovering patterns in large data sets

Data mining is a process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Computer science is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. One well known subject classification system for computer science is the ACM Computing Classification System devised by the Association for Computing Machinery.

In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Orange (software)

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization.

Crime analysis

Crime analysis is a law enforcement function that involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder. Information on patterns can help law enforcement agencies deploy resources in a more effective manner, and assist detectives in identifying and apprehending suspects. Crime analysis also plays a role in devising solutions to crime problems, and formulating crime prevention strategies. Quantitative social science data analysis methods are part of the crime analysis process, though qualitative methods such as examining police report narratives also play a role.

In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.

Visual analytics

Visual analytics is an outgrowth of the fields of information visualization and scientific visualization that focuses on analytical reasoning facilitated by interactive visual interfaces.

Cultural analytics refers to the use of computational, visualization, and big data methods for the exploration of contemporary and historical cultures. While digital humanities research has focused on text data, cultural analytics has a particular focus on massive cultural data sets of visual material – both digitized visual artifacts and contemporary visual and interactive media. Taking on the challenge of how to best explore large collections of rich cultural content, cultural analytics researchers developed new methods and intuitive visual techniques that rely on high-resolution visualization and digital image processing. These methods are used to address both the existing research questions in humanities, to explore new questions, and to develop new theoretical concepts that fit the mega-scale of digital culture in the early 21st century.

In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes. Relationships may be identified among various types of nodes (objects), including organizations, people and transactions. Link analysis has been used for investigation of criminal activity, computer security analysis, search engine optimization, market research, medical research, and art.

Motion chart

A motion chart is a dynamic bubble chart which allows efficient and interactive exploration and visualization of longitudinal multivariate Data. Motion charts provide mechanisms for mapping ordinal, nominal and quantitative variables onto time, 2D coordinate axes, size, colors, glyphs and appearance characteristics, which facilitate the interactive display of multidimensional and temporal data.

Interactive Visual Analysis (IVA) is a set of techniques for combining the computational power of computers with the perceptive and cognitive capabilities of humans, in order to extract knowledge from large and complex datasets. The techniques rely heavily on user interaction and the human visual system, and exist in the intersection between visual analytics and big data. It is a branch of data visualization. IVA is a suitable technique for analyzing high-dimensional data that has a large number of data points, where simple graphing and non-interactive techniques give an insufficient understanding of the information.

Trifacta

Trifacta is a privately owned software company headquartered in San Francisco with offices in Bengaluru, Boston, Berlin and London. The company was founded in October 2012 and primarily develops data wrangling software for data exploration and self-service data preparation on cloud and on-premises data platforms.

Flow cytometry bioinformatics is the application of bioinformatics to flow cytometry data, which involves storing, retrieving, organizing and analyzing flow cytometry data using extensive computational resources and tools. Flow cytometry bioinformatics requires extensive use of and contributes to the development of techniques from computational statistics and machine learning. Flow cytometry and related methods allow the quantification of multiple independent biomarkers on large numbers of single cells. The rapid growth in the multidimensionality and throughput of flow cytometry data, particularly in the 2000s, has led to the creation of a variety of computational analysis methods, data standards, and public databases for the sharing of results.

Paxata American private software company

Paxata is a privately owned software company headquartered in Redwood City, California. It develops self-service data preparation software that gets data ready for data analytics software. Paxata's software is intended for business analysts, as opposed to technical staff. It is used to combine data from different sources, then check it for data quality issues, such as duplicates and outliers. Algorithms and machine learning automate certain aspects of data preparation and users work with the software through a user-interface similar to Excel spreadsheets.

Feature engineering

Feature engineering or feature extraction is the process of using domain knowledge to extract features from raw data. The motivation is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process.

Jeffrey Heer American computer scientist

Jeffrey Michael Heer is an American computer scientist best known for his work on information visualization and interactive data analysis. He is a professor of computer science & engineering at the University of Washington, where he directs the UW Interactive Data Lab. He co-founded Trifacta with Joe Hellerstein and Sean Kandel in 2012.

Sean Kandel

Sean Kandel is Trifacta's Chief Technical Officer and Co-founder, along with Joseph M. Hellerstein and Jeffrey Heer. He is known for the development of new tools for data transformation and discovery and is the co-developed of Data Wrangler, an interactive tool for data cleaning and transformation.

VALCRI

Visual Analytics for Sense-making in Criminal Intelligence Analysis (VALCRI) is a software tool that helps investigators to find related or relevant information in several criminal databases. The software uses big data processes to aggregate information from a wide array of different sources and formats and compiles it into visual and readable arrangements for users. It is used by various law enforcement agencies and aims to allow officials to utilize statistical information in their operations and strategy. The project is funded by the European Commission and is led by Professor William Wong at Middlesex University.

Guided analytics is a sub-field at the interface of visual analytics and predictive analytics focused on the development of interactive visual interfaces for business intelligence applications. Such interactive applications serve the analyst to take important decisions by easily extracting information from the data.

Augmented Analytics is an approach of data analytics that employs the use of machine learning and natural language processing to automate analysis processes normally done by a specialist or data scientist. The term was introduced in 2017 by Rita Sallam, Cindi Howson, and Carlie Idoine in a Gartner research paper.

References

  1. 1 2 3 4 5 FOSTER Open Science, Overview of Data Exploration Techniques: Stratos Idreos, Olga Papaemmonouil, Surajit Chaudhuri.
  2. 1 2 3 Stanford.edu, 2011 Wrangler: Interactive Visual Specification of Data Transformation Scripts, Kandel, Paepcke, Hellerstein Heer.
  3. Arnab Nandi; H. V. Jagadish. Guided Interaction: Rethinking the Query-Result Paradigm (PDF). International Conference on Very Large Data Bases (VLDB) 2011.
  4. 1 2 3 Stanford.edu, IEEE Visual Analytics Science & Technology (VAST), Oct 2012 Enterprise Data Analysis and Visualization: An Interview Study., Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer Proc.
  5. Exploratory Data Analysis, Pearson. ISBN   978-0201076165
  6. Machine Learning for Data Exploration