Data loading

Last updated

Data loading, or simply loading, is a part of data processing where data is moved between two systems so that it ends up in a staging area on the target system.

Contents

With the traditional extract, transform and load (ETL) method, the load job is the last step, and the data that is loaded has already been transformed. With the alternative method extract, load and transform (ELT), the loading job is the middle step, and the transformed data is loaded in its original format for data transformation in the target system.

Traditionally, loading jobs on large systems have taken a long time, and have typically been run at night outside a company's opening hours.

Purpose

Two main goals of data loading are to obtain fresher data in the systems after loading, and that the loading is fast so that the data can be updated frequently. For full data refresh, faster loading can be achieved by turning off referential integrity, secondary indexes and logging, but this is usually not allowed with incremental update or trickle feed.

Types

Data loading can be done either by complete update (immediate), incremental loading and updating (immediate), or trickle feed (deferred). The choice of technique may depend on the amount of data that is updated, changed or added, and how up-to-date the data must be. The type of data delivered by the source system, and whether historical data delivered by the source system can be trusted are also important factors.

Full refresh

Full data refresh means that existing data in the target table is deleted first. All data from the source is then loaded into the target table, new indexes are created in the target table, and new measures are calculated for the updated table.

Full refresh is easy to implement, but involves moving of much data which can take a long time, and can make it challenging to keep historical data. [1]

Incremental update

Incremental update or incremental refresh means that only new or updated data is retrieved from the source system. [2] [3] The updated data is then added to the existing data in the target system, and the existing data in the target system is updated. The indices and statistics are updated accordingly. Incremental update can make loading faster and make it easier to keep track of history, but can be demanding to set up and maintain. [1]

Tricle feed

Tricle feed or trickle loading means that when the source system is updated, the changes in the target system will occur almost immediately. [4] [5]

Loading to systems that are in use

When loading data into a system that is currently in use by users or other systems, one must decide when the system should be updated and what will happen to tables that are in use at the same time as the system is to be updated. One possible solution is to make use of shadow tables. [6] [7]

See also

Related Research Articles

<span class="mw-page-title-main">Data warehouse</span> Centralized storage of knowledge

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is a core component of business intelligence. Data warehouses are central repositories of data integrated from disparate sources. They store current and historical data organized so as to make it easy to create reports, query and get insights from the data. Unlike databases, they are intended to be used by analysts and managers to help make organizational decisions.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

Extract, transform, load (ETL) is a three-phase computing process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on recurring schedules either as single jobs or aggregated into a batch of jobs.

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. The result is a delta-driven dataset.

In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function.

In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume, complexity and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

Real-time business intelligence (RTBI) is a concept describing the process of delivering business intelligence (BI) or information about business operations as they occur. Real time means near to zero latency and access to information whenever it is required.

Spatial extract, transform, load, also known as geospatial transformation and load (GTL), is a process for managing and manipulating geospatial data, for example map data. It is a type of extract, transform, load (ETL) process, with software tools and libraries specialised for geographical information.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

In the data warehouse practice of extract, transform, load (ETL), an early fact or early-arriving fact, also known as late-arriving dimension or late-arriving data, denotes the detection of a dimensional natural key during fact table source loading, prior to the assignment of a corresponding primary key or surrogate key in the dimension table. Hence, the fact which cites the dimension arrives early, relative to the definition of the dimension value. An example could be backdating or making corrections to data.

<span class="mw-page-title-main">Data vault modeling</span> Database modeling method

Datavault or data vault modeling is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed and resilience to change as well as emphasizing the need to trace where all the data in the database came from. This means that every row in a data vault must be accompanied by record source and load date attributes, enabling an auditor to trace values back to the source. The concept was published in 2000 by Dan Linstedt.

A staging area, or landing zone, is an intermediate storage area used for data processing during the extract, transform and load (ETL) process. The data staging area sits between the data source(s) and the data target(s), which are often data warehouses, data marts, or other data repositories.

Oracle Data Integrator (ODI) is an extract, load, transform (ELT) tool produced by Oracle that offers a graphical environment to build, manage and maintain data integration processes in business intelligence systems.

Innovative Routines International (IRI), Inc. is an American software company first known for bringing mainframe sort merge functionality into open systems. IRI was the first vendor to develop a commercial replacement for the Unix sort command, and combine data transformation and reporting in Unix batch processing environments. In 2007, IRI's coroutine sort ("CoSort") became the first product to collate and convert multi-gigabyte XML and LDIF files, join and lookup across multiple files, and apply role-based data privacy functions for fields within sensitive files.

Data virtualization is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view of the overall data.

The term is used for two different things:

  1. In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first. This may improve the power usage and performance of moving data between the processor and the main memory.
  2. In software engineering, in-memory processing is a software architecture where a database is kept entirely in random-access memory (RAM) or flash memory so that usual accesses, in particular read or query operations, do not require access to disk storage. This may allow faster data operations such as "joins", and faster reporting and decision-making in business.

Agile Business Intelligence (ABI) refers to the use of Agile Software Development for business intelligence projects. The primary goal of ABI is to accelerate the delivery of valuable insights to the organization, enabling faster decision-making compared to traditional BI approaches. ABI attempts to enable the business intelligence team, businesspeople or stakeholders to make business decisions more quickly.

Extract, load, transform (ELT) is an alternative to extract, transform, load (ETL) used with data lake implementations. In contrast to ETL, in ELT models the data is not transformed on entry to the data lake, but stored in its original raw format. This enables faster loading times. However, ELT requires sufficient processing power within the data processing engine to carry out the transformation on demand, to return the results in a timely manner. Since the data is not processed on entry to the data lake, the query and schema do not need to be defined a priori. ELT is a data pipeline model.

Data blending is a process whereby big data from multiple sources are merged into a single data warehouse or data set.

References

  1. 1 2 "Incremental Data Load vs Full Load ETL: 4 Critical Differences - Learn | Hevo". 2022-04-14. Retrieved 2023-02-18.
  2. "Incremental Loading" . Retrieved 2023-02-18.
  3. Mitchell, Tim (2020-07-23). "The What, Why, When, and How of Incremental Loads" . Retrieved 2023-02-18.
  4. Zuters, Janis (2011). "Near Real-Time Data Warehousing with Multi-stage Trickle and Flip". In Grabis, Janis; Kirikova, Marite (eds.). Perspectives in Business Informatics Research. Lecture Notes in Business Information Processing. Vol. 90. Springer Berlin Heidelberg. pp. 73–82. doi:10.1007/978-3-642-24511-4_6. ISBN   978-3-642-24510-7. a data warehouse typically is a collection of historical data designed for decision support, so it is updated from the sources periodically, mostly on a daily basis. today's business however asks for fresher data. real-time warehousing is one of the trends to accomplish this, but there are a number of challenges to move towards true real-time. this paper proposes 'multi-stage trickle and flip' methodology for data warehouse refreshment. it is based on the 'trickle and flip' principle and extended in order to further insulate loading and querying activities, thus enabling both of them to be more efficient.
  5. "Trickle Loading Data" . Retrieved 2023-02-18.
  6. "Create shadow tables for synchronization - Data Management - Alibaba Cloud Documentation Center" . Retrieved 2023-02-18.
  7. "Shadow tables". IBM . 2015-08-10. Retrieved 2023-02-18.