Data profiling

Last updated June 05, 2022

Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data.^[1] The purpose of these statistics may be to:

Find out whether existing data can be easily used for other purposes
Improve the ability to search data by tagging it with keywords, descriptions, or assigning it to a category
Assess data quality, including whether the data conforms to particular standards or patterns^[2]
Assess the risk involved in integrating data in new applications, including the challenges of joins
Discover metadata of the source database, including value patterns and distributions, key candidates, foreign-key candidates, and functional dependencies
Assess whether known metadata accurately describes the actual values in the source database
Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
Have an enterprise view of all data, for uses such as master data management, where key data is needed, or data governance for improving data quality.

Introduction

Data profiling refers to the analysis of information for use in a data warehouse in order to clarify the structure, content, relationships, and derivation rules of the data.^[3] Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata.^[4]^[5] The result of the analysis is used to determine the suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design.^[3]

How data profiling is conducted

Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.^[4]^[6]^[7] The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates.

Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.^[4]

Normally, purpose-built tools are used for data profiling to ease the process.^[3]^[4]^[6]^[7]^[8]^[9] The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.^[5]

When data profiling is conducted

According to Kimball,^[3] data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken immediately after candidate source systems have been identified and DW/BI business requirements have been satisfied. The purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate detail level and that anomalies can be handled subsequently. If this is not the case the project may be terminated.^[3]

Additionally, more in-depth profiling is done prior to the dimensional modeling process in order assess what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system design process in order to determine the appropriate data to extract and which filters to apply to the data set.^[3]

Additionally, data profiling may be conducted in the data warehouse development process after data has been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning and transformations have been done correctly and in compliance of requirements.

Benefits and examples

The benefits of data profiling are to improve data quality, shorten the implementation cycle of major projects, and improve users' understanding of data.^[9] Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling.^[5] Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.^[9]

Related Research Articles

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, dashboard development, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.

In computing, extract, transform, load (ETL) is a three-phase process where data is first extracted then transformed and finally loaded into an output data container. The data can be collated from one or more sources and it can also be outputted to one or more destinations. ETL processing is typically executed using software applications but it can be also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reoccurring schedules either as single jobs or aggregated into a batch of jobs.

In computing, the star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. Where multiple fact tables are used, these are arranged as a fact constellation schema. A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non-additive, and semi-additive measures.

Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning". Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case, data governance is used to form agreed upon definitions and standards for data quality. In such cases, data cleansing, including standardization, may be required in order to ensure data quality.

In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including:

Dimension (data warehouse) Structure that categorizes facts and measures in a data warehouse

A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.

Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.

In metadata, metadata discovery is the process of using automated tools to discover the semantics of a data element in data sets. This process usually ends with a set of mappings between the data source elements and a centralized metadata registry. Metadata discovery is also known as metadata scanning.

In data management and database analysis, a data domain is the collection of values that a data element may contain. The rule for determining the domain boundary may be as simple as a data type with an enumerated list of values.

A slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule. Some examples of typical slowly changing dimensions are entities as names of geographical locations, customers, or products.

Entity–attribute–value model (EAV) is a data model to encode, in a space-efficient manner, entities where the number of attributes that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. Such entities correspond to the mathematical notion of a sparse matrix.

A data steward is an oversight or data governance role within an organization, and is responsible for ensuring the quality and fitness for purpose of the organization's data assets, including the metadata for those data assets. A data steward may share some responsibilities with a data custodian, such as the awareness, accessibility, release, appropriate use, security and management of data. A data steward would also participate in the development and implementation of data assets. A data steward may seek to improve the quality and fitness for purpose of other data assets their organization depends upon but is not responsible for.

Dimensional modeling (DM) is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design. The approach focuses on identifying the key business processes within a business and modelling and implementing these first before adding additional business processes, as a bottom-up approach. An alternative approach from Inmon advocates a top down design of the model of all the enterprise data using tools such as entity-relationship modeling (ER).

Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:

Data vault modeling is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed and resilience to change as well as emphasizing the need to trace where all the data in the database came from. This means that every row in a data vault must be accompanied by record source and load date attributes, enabling an auditor to trace values back to the source. It was developed by Daniel (Dan) Linstedt in 2000.

The Enterprise Bus Matrix is a Data Warehouse planning tool and model created by Ralph Kimball, and is part of the Data Warehouse Bus Architecture. The Matrix is the logical definition of one of the core concepts of Kimball’s approach to Dimensional Modeling – Conformed dimensions.

The following is provided as an overview of and topical guide to databases:

Common Operational Datasets or CODs, are authoritative reference datasets needed to support operations and decision-making for all actors in a humanitarian response. CODs are 'best available' datasets that ensure consistency and simplify the discovery and exchange of key data. The data is typically geo-spatially linked using a coordinate system and have unique geographic identification codes (P-codes).

References

↑ Johnson, Theodore (2009). "Data Profiling". In Springer, Heidelberg (ed.). Encyclopedia of Database Systems.
↑ Woodall, Philip; Oberhofer, Martin; Borek, Alexander (2014). "A classification of data quality assessment and improvement methods". International Journal of Information Quality. 3 (4): 298. doi:10.1504/ijiq.2014.068656.
1 2 3 4 5 6 Kimball, Ralph; et al. (2008). The Data Warehouse Lifecycle Toolkit (Second ed.). Wiley. pp. 376. ISBN 9780470149775.
1 2 3 4 Loshin, David (2009). Master Data Management . Morgan Kaufmann. pp. 94–96. ISBN 9780123742254.
1 2 3 Loshin, David (2003). Business Intelligence: The Savvy Manager's Guide, Getting Onboard with Emerging IT. Morgan Kaufmann. pp. 110–111. ISBN 9781558609167.
1 2 Rahm, Erhard; Hai Do, Hong (December 2000). "Data Cleaning: Problems and Current Approaches". Bulletin of the Technical Committee on Data Engineering. IEEE Computer Society. 23 (4).
1 2 Singh, Ranjit; Singh, Kawaljeet; et al. (May 2010). "A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing". IJCSI International Journal of Computer Science Issue. 2. 7 (3).
↑ Kimball, Ralph (2004). "Kimball Design Tip #59: Surprising Value of Data Profiling" (PDF). Kimball Group.
1 2 3 Olson, Jack E. (2003). Data Quality: The Accuracy Dimension . Morgan Kaufmann. pp. 140–142.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Johnson2009-1] Johnson, Theodore (2009). "Data Profiling". In Springer, Heidelberg (ed.). Encyclopedia of Database Systems.

[2] Woodall, Philip; Oberhofer, Martin; Borek, Alexander (2014). "A classification of data quality assessment and improvement methods". International Journal of Information Quality. 3 (4): 298. doi:10.1504/ijiq.2014.068656.

[Kimball2008-3] 1 2 3 4 5 6 Kimball, Ralph; et al. (2008). The Data Warehouse Lifecycle Toolkit (Second ed.). Wiley. pp. 376. ISBN 9780470149775.

[Loshin2009-4] 1 2 3 4 Loshin, David (2009). Master Data Management . Morgan Kaufmann. pp. 94–96. ISBN 9780123742254.

[Loshin2003-5] 1 2 3 Loshin, David (2003). Business Intelligence: The Savvy Manager's Guide, Getting Onboard with Emerging IT. Morgan Kaufmann. pp. 110–111. ISBN 9781558609167.

[Rahm2000-6] 1 2 Rahm, Erhard; Hai Do, Hong (December 2000). "Data Cleaning: Problems and Current Approaches". Bulletin of the Technical Committee on Data Engineering. IEEE Computer Society. 23 (4).

[Singh2010-7] 1 2 Singh, Ranjit; Singh, Kawaljeet; et al. (May 2010). "A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing". IJCSI International Journal of Computer Science Issue. 2. 7 (3).

[Kimball2004-8] Kimball, Ralph (2004). "Kimball Design Tip #59: Surprising Value of Data Profiling" (PDF). Kimball Group.

[Olson2003-9] 1 2 3 Olson, Jack E. (2003). Data Quality: The Accuracy Dimension . Morgan Kaufmann. pp. 140–142.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]