Data drilling

Last updated

Data drilling (also drilldown) refers to any of various operations and transformations on tabular, relational, and multidimensional data. The term has widespread use in various contexts, but is primarily associated with specialized software designed specifically for data analysis.

Contents

Common data drilling operations

There are certain operations that are common to applications that allow data drilling. Among them are:

Query operations:

Tabular query

Tabular query operations consist of standard operations on data tables.

Among these operations are:

Consider the following example:

Fred and Wilma table (Fig 001):

   gender, fname, lname, home    male, fred, chopin, Poland    male, fred, flintstone, bedrock    male, fred, durst, usa    female, wilma, flintstone, bedrock    female, wilma, rudolph, usa    female, wilma, webb, usa    male, fred, johnson, usa

The preceding is an example of a simple flat file table formatted as comma-separated values. The table includes first name, last name, gender and home country for various people named fred or wilma. Although the example is formatted this way, it is important to emphasize that tabular query operations (as well as all data drilling operations) can be applied to any conceivable data type, regardless of the underlying formatting. The only requirement is that the data be readable by the software application in use.

Pivot query

A pivot query allows multiple representations of data according to different dimensions. This query type is similar to tabular query, except it also allows data to be represented in summary format, according to a flexible user-selected hierarchy. This class of data drilling operation is formally, (and loosely) known by different names, including crosstab query , pivot table , data pilot, selective hierarchy, intertwingularity and others.

To illustrate the basics of pivot query operations, consider the Fred and Wilma table (Fig 001). A quick scan of the data reveals that the table has redundant information. This redundancy could be consolidated using an outline or a tree structure or in some other way. Moreover, once consolidated, the data could have many different alternate layouts.

Using a simple text outline as output, the following alternate layouts are all possible with a pivot query:

Summarize by gender (Fig 001):

   female        flintstone, wilma        rudolph, wilma        webb, wilma    male        chopin, fred        flintstone, fred        durst, fred        johnson, fred        (Dimensions = gender; Tabular fields = lname, fname;)

Summarize by home, lname (Fig 001):

   bedrock        flintstone            fred            wilma    Poland        chopin            fred    usa        ...        (Dimensions = home, lname; Tabular fields = fname;)

Uses

Pivot query operations are useful for summarizing a corpus of data in multiple ways, thereby illustrating different representations of the same basic information. Although this type of operation appears prominently in spreadsheets and desktop database software, its flexibility is arguably under-utilized. There are many applications that allow only a 'fixed' hierarchy for representing data, and this represents a substantial limitation.

Drillup

Drillup is the opposite off drilldown. For example, if you drilldown for to see the revenue of one product, then you might want to drillup to see the revenue of all products. [1]

Related Research Articles

Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data. It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization differs from the unnormalized form in that denormalization benefits can only be fully realized on a data model that is otherwise normalized.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">Flat-file database</span> Database stored as an ordinary unstructured file

A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file, or a binary file. Relationships can be inferred from the data in the database, but the database format itself does not make those relationships explicit.

<span class="mw-page-title-main">Hierarchical Data Format</span> Set of file formats

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

<span class="mw-page-title-main">OLAP cube</span> Multidimensional data array organized for rapid analysis

An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term cube here refers to a multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is greater than three.

In computer programming contexts, a data cube is a multi-dimensional ("n-D") array of values. Typically, the term data cube is applied in contexts where these arrays are massively larger than the hosting computer's main memory; examples include multi-terabyte/petabyte data warehouses and time series of image data.

<i>The Man Called Flintstone</i> 1966 film by William Hanna and Joseph Barbera

The Man Called Flintstone is a 1966 American animated musical comedy film produced by Hanna-Barbera Productions and distributed by Columbia Pictures. The second film by Hanna-Barbera following Hey There, It's Yogi Bear! (1964), it was directed by series creators/studio founders William Hanna and Joseph Barbera from a screenplay by Harvey Bullock and R. S. Allen.

A pivot table is a table of values which are aggregations of groups of individual values from a more extensive table within one or more discrete categories. The aggregations or summaries of the groups of the individual terms might include sums, averages, counts, or other statistics. A pivot table is the outcome of the statistical processing of tabularized raw data and can be used for decision-making.

Microsoft SQL Server Analysis Services (SSAS) is an online analytical processing (OLAP) and data mining tool in Microsoft SQL Server. SSAS is used as a tool by organizations to analyze and make sense of information possibly spread out across multiple databases, or in disparate tables or files. Microsoft has included a number of services in SQL Server related to business intelligence and data warehousing. These services include Integration Services, Reporting Services and Analysis Services. Analysis Services includes a group of OLAP and data mining capabilities and comes in two flavors multidimensional and tabular, where the difference between the two is how the data is presented. In a tabular model, the information is arranged in two-dimensional tables which can thus be more readable for a human. A multidimensional model can contain information with many degrees of freedom, and must be unfolded to increase readability by a human.

<span class="mw-page-title-main">Dimension (data warehouse)</span> Structure that categorizes facts and measures in a data warehouse

A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.

Gellish is an ontology language for data storage and communication, designed and developed by Andries van Renssen since mid-1990s. It started out as an engineering modeling language but evolved into a universal and extendable conceptual data modeling language with general applications. Because it includes domain-specific terminology and definitions, it is also a semantic data modelling language and the Gellish modeling methodology is a member of the family of semantic modeling methodologies.

MapInfo Pro is a desktop geographic information system (GIS) software product produced by Precisely and used for mapping and location analysis. MapInfo Pro allows users to visualize, analyze, edit, interpret, understand and output data to reveal relationships, patterns, and trends. MapInfo Pro allows users to explore spatial data within a dataset, symbolize features, and create maps.

Dimensional modeling (DM) is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design. The approach focuses on identifying the key business processes within a business and modelling and implementing these first before adding additional business processes, as a bottom-up approach. An alternative approach from Inmon advocates a top down design of the model of all the enterprise data using tools such as entity-relationship modeling (ER).

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

<i>Fred Flintstone and Friends</i> American animated television series

Fred Flintstone and Friends is an American animated anthology wheel series and a spin-off of The Flintstones produced by Hanna-Barbera and Columbia Pictures Television that aired in daily first-run syndication from September 12, 1977, to September 1, 1978. The series was packaged by Columbia Pictures Television during the 1977–78 television season and was available for barter syndication through Claster Television through the mid-1980s.

Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

Power Pivot, formerly known as PowerPivot, is a feature of Microsoft Excel, a computer software spreadsheet. It is available as an add-in in Excel 2010, 2013 in separate downloads, and as an add-in included with the Excel 2016 program. Power Pivot extends a local instance of Microsoft Analysis Services tabular that is embedded directly into an Excel Workbook. This allows a user to build a ROLAP model in Power Pivot, and use pivot tables to explore the model once it is built. This allows Excel to act as a self-service business intelligence (BI) platform, implementing professional expression languages to query the model and calculate advanced measures.

Within database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

Cubes is a light-weight open source multidimensional modelling and OLAP toolkit for development reporting applications and browsing of aggregated data written in Python programming language released under the MIT License.

References

  1. "Drilling up and drilling down". IBM . Retrieved 2020-05-05.