Data blending

Last updated

Data blending is a process whereby big data from multiple sources [1] are merged into a single data warehouse or data set. [2]

Contents

Data blending allows business analysts to cope with the expansion of data that they need to make critical business decisions based on good quality business intelligence. [3] Data blending has been described as different from data integration due to the requirements of data analysts to merge sources very quickly, too quickly for any practical intervention by data scientists. [4] A study done by Forrester Consulting in 2015 found that 52 percent of companies are blending 50 or more data sources and 12 percent are blending over 1,000 sources. [5]

Extract, transform, load

Data blending is similar to extract, transform, load (ETL). Both ETL and data blending take data from various sources and combine them. However, ETL is used to merge and structure data into a target database, [6] often a data warehouse. Data blending differs slightly as it's about joining data for a specific use case at a specific time. [7] With some software, data isn't written into a database, which is very different to ETL. For example, with Google Data Studio. [8]

Software products

Representing the increased demand for analysts to combine data sources, multiple software companies have seen large growth and raised millions of dollars, [9] with some early entrants into the market now public companies. [10] Examples include AWS, Alteryx, Microsoft Power Query, [11] and Incorta, [12] which enable combining data from many different data sources, for example, text files, databases, XML, JSON, and many other forms of structured and semi-structured data. [13] [14] [15] [16]

Tableau

In tableau software, data blending is a technique to combine data from multiple data sources in the data visualization. [17] A key differentiator is the granularity of the data join. When blending data into a single data set, this would use a SQL database join, which would usually join at the most granular level, using an ID field where possible. [18] A data blend in tableau should happen at the least granular level. [19]

Looker Studio

In Google's Looker Studio, data sources are combined by joining the records of one data source with the records of up to 4 other data sources. Similar to Tableau, the data blend only happens on the reporting layer. The blended data is never stored as a separate combined data source. [20]

Challenges with data blending

The most common custom metadata question is: "How can this dataset blend with (join or union to) my other datasets?" [21]

See also

Related Research Articles

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reccurring schedules either as single jobs or aggregated into a batch of jobs.

Business intelligence software is a type of application software designed to retrieve, analyze, transform and report data for business intelligence. The applications generally read data that has been previously stored, often - though not necessarily - in a data warehouse or data mart.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

BigQuery is Google's fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform as a Service (PaaS) that supports querying using a dialect of SQL. It also has built-in machine learning capabilities. BigQuery was announced in May 2010 and made generally available in November 2011.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. It is built on top of technology from the massive parallel processing (MPP) data warehouse company ParAccel, to handle large scale data sets and database migrations. Redshift differs from Amazon's other hosted database offering, Amazon RDS, in its ability to handle analytic workloads on big data data sets stored by a column-oriented DBMS principle. Redshift allows up to 16 petabytes of data on a cluster compared to Amazon RDS Aurora's maximum size of 128 terabytes.

<span class="mw-page-title-main">Trifacta</span>

Trifacta is a privately owned software company headquartered in San Francisco with offices in Bengaluru, Boston, Berlin and London. The company was founded in October 2012 and primarily develops data wrangling software for data exploration and self-service data preparation on cloud and on-premises data platforms.

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data.

<span class="mw-page-title-main">Databricks</span> American software company

Databricks, Inc. is an American software company founded by the original creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.

<span class="mw-page-title-main">Zoomdata</span> American software company

Zoomdata is a business intelligence software company that specializes in real-time data visualization of big data, streaming data, and multisource analysis. The company's products are deployable on-prem, in the cloud, and embedded in other applications. SAP Data Visualization by Zoomdata is a SaaS version of Zoomdata for SAP customers. On June 10, 2019, Zoomdata was acquired by Logi Analytics for an undisclosed sum.

Extract, load, transform (ELT) is an alternative to extract, transform, load (ETL) used with data lake implementations. In contrast to ETL, in ELT models the data is not transformed on entry to the data lake, but stored in its original raw format. This enables faster loading times. However, ELT requires sufficient processing power within the data processing engine to carry out the transformation on demand, to return the results in a timely manner. Since the data is not processed on entry to the data lake, the query and schema do not need to be defined a priori. ELT is a data pipeline model.

Azure Cosmos DB is a globally distributed, multi-model database service and offered by Microsoft. It is designed to provide high availability, scalability, and low-latency access to data for modern applications. Unlike traditional relational databases, Cosmos DB is a NoSQL database, which means it can handle unstructured and semi-structured, in addition to structured, data types.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Microsoft Power BI is an interactive data visualization software product developed by Microsoft with a primary focus on business intelligence. It is part of the Microsoft Power Platform. Power BI is a collection of software services, apps, and connectors that work together to turn various sources of data into static and interactive data visualizations. Data may be input by reading directly from a database, webpage, PDF, or structured files such as spreadsheets, CSV, XML, JSON, XLSX, and SharePoint.

Kyvos is a business intelligence acceleration platform for cloud and big data platforms developed by an American privately held company named Kyvos Insights. The company, headquartered in Los Gatos, California, was founded by Praveen Kankariya, CEO of Impetus Technologies. The software provides OLAP-based multidimensional analysis on big data and cloud platforms and was launched officially in June 2015. In December the same year, the company was listed among the 10 Coolest Big Data Startups of 2015 by CRN Magazine.

<span class="mw-page-title-main">Trino (SQL query engine)</span> Open-source distributed SQL query engine

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query data lakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

Power Query is an ETL tool created by Microsoft for data extraction, loading and transformation, and is used to retrieve data from sources, process it, and load them into one or more target systems. Power Query is available in several variations within the Microsoft Power Platform, and is used for business intelligence on fully or partially self-service platforms. It is found in software such as Excel, Power BI, Analysis Services, Dataverse, Power Apps, Azure Data Factory, SSIS, Dynamics 365, and in cloud services such as Microsoft Dataflows, including Power BI Dataflow used with the online Power BI Service or the somewhat more generic version of Microsoft Dataflow used with Power Automate.

References

  1. Alteryx Analytics Brings Power of Predictive and Big Data to Market
  2. Data blending is the process of combining data from multiple sources into a functioning data set
  3. "Data Blending". Trifacta.com. 24 August 2017.
  4. What Is Data Blending, and Which Tools Make It Easier?
  5. "Data Mashups for Analytics". Pentaho.
  6. "How ETL Works" (in German). Databricks. Retrieved 2021-02-27.
  7. "What Is Data Blending, and Which Tools Make It Easier?". Software Advice. 2016-08-25. Retrieved 2021-02-27.
  8. "Google Data Studio Overview". datastudio.google.com. Retrieved 2021-02-27.
  9. "Incorta raises $30M Series C for ETL-free data processing solution". TechCrunch. Retrieved 2021-02-27.
  10. "Alteryx Announces Pricing of Initial Public Offering". Alteryx. Retrieved 2021-02-27.
  11. Corporation, Microsoft. "Microsoft Power Query". powerquery.microsoft.com. Retrieved 2021-02-27.
  12. "Direct Data Analytics Software". Incorta. Retrieved 2021-02-27.
  13. "Data Sources". docs.incorta.com. Retrieved 2021-02-27.
  14. davidiseminger. "Shape and combine data from multiple sources using Power Query". docs.microsoft.com. Retrieved 2021-02-27.
  15. "Supported Data Sources: Amazon QuickSight". docs.aws.amazon.com. Retrieved 2021-02-27.
  16. "Data Sources". Alteryx Help. Retrieved 2021-02-27.
  17. "Blend Your Data". help.tableau.com. Retrieved 2021-02-27.
  18. "SQL Joins Explained". SQL Joins Explained. Retrieved 2021-02-27.
  19. TAR Solutions (2021-01-20). "Data Blending in Tableau". TAR Solutions. Retrieved 2021-02-27.
  20. "About data blending - Data Studio Help". support.google.com. Retrieved 2021-02-27.
  21. Heer, Jeffrey; Hellerstein, Joseph; Kandel, Sean; Rattenbury, Tye (July 2017). Principles of Data Wrangling. O'Reilly Media.