Data transformation (computing)

Last updated

In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration [1] and data management tasks such as data wrangling, data warehousing, data integration and application integration.

Contents

Data transformation can be simple or complex based on the required changes to the data between the source (initial) data and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps. [2] Tools and technologies used for data transformation can vary widely based on the format, structure, complexity, and volume of the data being transformed.

A master data recast is another form of data transformation where the entire database of data values is transformed or recast without extracting the data from the database. All data in a well designed database is directly or indirectly related to a limited set of master database tables by a network of foreign key constraints. Each foreign key constraint is dependent upon a unique database index from the parent database table. Therefore, when the proper master database table is recast with a different unique index, the directly and indirectly related data are also recast or restated. The directly and indirectly related data may also still be viewed in the original form since the original unique index still exists with the master data. Also, the database recast must be done in such a way as to not impact the applications architecture software.

When the data mapping is indirect via a mediating data model, the process is also called data mediation.

Data transformation process

Data transformation can be divided into the following steps, each applicable as needed based on the complexity of the transformation required.

These steps are often the focus of developers or technical data analysts who may use multiple specialized tools to perform their tasks.

The steps can be described as follows:

Data discovery is the first step in the data transformation process. Typically the data is profiled using profiling tools or sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and decide how it needs to be transformed.

Data mapping is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules (e.g. visual ETL tools, [3] transformation languages).

Code generation is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions) that will transform the data based on the desired and defined data mapping rules. [4] Typically, the data transformation technologies generate this code [5] based on the definitions or metadata defined by the developers.

Code execution is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code.

Data review is the final step in the process, which focuses on ensuring the output data meets the transformation requirements. It is typically the business user or final end-user of the data that performs this step. Any anomalies or errors in the data that are found and communicated back to the developer or data analyst as new requirements to be implemented in the transformation process. [1]

Types of data transformation

Batch data transformation

Traditionally, data transformation has been a bulk or batch process, [6] whereby developers write code or implement transformation rules in a data integration tool, and then execute that code or those rules on large volumes of data. [7] This process can follow the linear set of steps as described in the data transformation process above.

Batch data transformation is the cornerstone of virtually all data integration technologies such as data warehousing, data migration and application integration. [1]

When data must be transformed and delivered with low latency, the term "microbatch" is often used. [6] This refers to small batches of data (e.g. a small number of rows or small set of data objects) that can be processed very quickly and delivered to the target system when needed.

Benefits of batch data transformation

Traditional data transformation processes have served companies well for decades. The various tools and technologies (data profiling, data visualization, data cleansing, data integration etc.) have matured and most (if not all) enterprises transform enormous volumes of data that feed internal and external applications, data warehouses and other data stores. [8]

Limitations of traditional data transformation

This traditional process also has limitations that hamper its overall efficiency and effectiveness. [1] [2] [7]

The people who need to use the data (e.g. business users) do not play a direct role in the data transformation process. [9] Typically, users hand over the data transformation task to developers who have the necessary coding or technical skills to define the transformations and execute them on the data. [8]

This process leaves the bulk of the work of defining the required transformations to the developer, which often in turn do not have the same domain knowledge as the business user. The developer interprets the business user requirements and implements the related code/logic. This has the potential of introducing errors into the process (through misinterpreted requirements), and also increases the time to arrive at a solution. [9] [10]

This problem has given rise to the need for agility and self-service in data integration (i.e. empowering the user of the data and enabling them to transform the data themselves interactively). [7] [10]

There are companies that provide self-service data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical knowledge and process complexity that currently exists. While these companies use traditional batch transformation, their tools enable more interactivity for users through visual platforms and easily repeated scripts. [11]

Still, there might be some compatibility issues (e.g. new data sources like IoT may not work correctly with older tools) and compliance limitations due to the difference in data governance, preparation and audit practices. [12]

Interactive data transformation

Interactive data transformation (IDT) [13] is an emerging capability that allows business analysts and business users the ability to directly interact with large datasets through a visual interface, [9] understand the characteristics of the data (via automated data profiling or visualization), and change or correct the data through simple interactions such as clicking or selecting certain elements of the data. [2]

Although interactive data transformation follows the same data integration process steps as batch data integration, the key difference is that the steps are not necessarily followed in a linear fashion and typically don't require significant technical skills for completion. [14]

There are a number of companies that provide interactive data transformation tools, including Trifacta, Alteryx and Paxata. They are aiming to efficiently analyze, map and transform large volumes of data while at the same time abstracting away some of the technical complexity and processes which take place under the hood.

Interactive data transformation solutions provide an integrated visual interface that combines the previously disparate steps of data analysis, data mapping and code generation/execution and data inspection. [8] That is, if changes are made at one step (like for example renaming), the software automatically updates the preceding or following steps accordingly. Interfaces for interactive data transformation incorporate visualizations to show the user patterns and anomalies in the data so they can identify erroneous or outlying values. [9]

Once they've finished transforming the data, the system can generate executable code/logic, which can be executed or applied to subsequent similar data sets.

By removing the developer from the process, interactive data transformation systems shorten the time needed to prepare and transform the data, eliminate costly errors in interpretation of user requirements and empower business users and analysts to control their data and interact with it as needed. [10]

Transformational languages

There are numerous languages available for performing data transformation. Many transformation languages require a grammar to be provided. In many cases, the grammar is structured using something closely resembling Backus–Naur form (BNF). There are numerous languages available for such purposes varying in their accessibility (cost) and general usefulness. [15] Examples of such languages include:

Additionally, companies such as Trifacta and Paxata have developed domain-specific transformational languages (DSL) for servicing and transforming datasets. The development of domain-specific languages has been linked to increased productivity and accessibility for non-technical users. [16] Trifacta's “Wrangle” is an example of such a domain specific language. [17]

Another advantage of the recent domain-specific transformational languages trend is that a domain-specific transformational language can abstract the underlying execution of the logic defined in the domain-specific transformational language. They can also utilize that same logic in various processing engines, such as Spark, MapReduce, and Dataflow. In other words, with a domain-specific transformational language, the transformation language is not tied to the underlying engine. [17]

Although transformational languages are typically best suited for transformation, something as simple as regular expressions can be used to achieve useful transformation. A text editor like vim, emacs or TextPad supports the use of regular expressions with arguments. This would allow all instances of a particular pattern to be replaced with another pattern using parts of the original pattern. For example:

foo ("some string", 42, gCommon); bar (someObj, anotherObj);  foo ("another string", 24, gCommon); bar (myObj, myOtherObj); 

could both be transformed into a more compact form like:

foobar("some string", 42, someObj, anotherObj); foobar("another string", 24, myObj, myOtherObj);

In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with two arguments would be replaced with a single function invocation using some or all of the original set of arguments.

Another advantage to using regular expressions is that they will not fail the null transform test. That is, using your transformational language of choice, run a sample program through a transformation that doesn't perform any transformations. Many transformational languages will fail this test.

See also

Related Research Articles

In computing, Common Gateway Interface (CGI) is an interface specification that enables web servers to execute an external program to process HTTP or HTTPS user requests.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on reccurring schedules either as single jobs or aggregated into a batch of jobs.

Model Driven Architecture (MDA) is a software design approach for the development of software systems. It provides a set of guidelines for the structuring of specifications, which are expressed as models. Model Driven Architecture is a kind of domain engineering, and supports model-driven engineering of software systems. It was launched by the Object Management Group (OMG) in 2001.

<span class="mw-page-title-main">IDEF</span> Family of modeling languages

IDEF, initially an abbreviation of ICAM Definition and renamed in 1999 as Integration Definition, is a family of modeling languages in the field of systems and software engineering. They cover a wide range of uses from functional modeling to data, simulation, object-oriented analysis and design, and knowledge acquisition. These definition languages were developed under funding from U.S. Air Force and, although still most commonly used by them and other military and United States Department of Defense (DoD) agencies, are in the public domain.

<span class="mw-page-title-main">Visual programming language</span> Programming language written graphically by a user

In computing, a visual programming language, also known as diagrammatic programming, graphical programming or block coding, is a programming language that lets users create programs by manipulating program elements graphically rather than by specifying them textually. A VPL allows programming with visual expressions, spatial arrangements of text and graphic symbols, used either as elements of syntax or secondary notation. For example, many VPLs are based on the idea of "boxes and arrows", where boxes or other screen objects are treated as entities, connected by arrows, lines or arcs which represent relations.

A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains. There are a wide variety of DSLs, ranging from widely used languages for common domains, such as HTML for web pages, down to languages used by only one or a few pieces of software, such as MUSH soft code. DSLs can be further subdivided by the kind of language, and include domain-specific markup languages, domain-specific modeling languages, and domain-specific programming languages. Special-purpose computer languages have always existed in the computer age, but the term "domain-specific language" has become more popular due to the rise of domain-specific modeling. Simpler DSLs, particularly ones used by a single application, are sometimes informally called mini-languages.

Web development is the work involved in developing a website for the Internet or an intranet. Web development can range from developing a simple single static page of plain text to complex web applications, electronic businesses, and social network services. A more comprehensive list of tasks to which Web development commonly refers, may include Web engineering, Web design, Web content development, client liaison, client-side/server-side scripting, Web server and network security configuration, and e-commerce development.

In software testing, test automation is the use of software separate from the software being tested to control the execution of tests and the comparison of actual outcomes with predicted outcomes. Test automation can automate some repetitive but necessary tasks in a formalized testing process already in place, or perform additional testing that would be difficult to do manually. Test automation is critical for continuous delivery and continuous testing.

<span class="mw-page-title-main">Data Transformation Services</span> Microsoft database tool

Data Transformation Services (DTS) is a Microsoft database tool with a set of objects and utilities to allow the automation of extract, transform and load operations to or from a database. The objects are DTS packages and their components, and the utilities are called DTS tools. DTS was included with earlier versions of Microsoft SQL Server, and was almost always used with SQL Server databases, although it could be used independently with other databases.

<span class="mw-page-title-main">Metamodeling</span> Concept of software engineering

A metamodel is a model of a model, and metamodeling is the process of generating such metamodels. Thus metamodeling or meta-modeling is the analysis, construction and development of the frames, rules, constraints, models and theories applicable and useful for modeling a predefined class of problems. As its name implies, this concept applies the notions of meta- and modeling in software engineering and systems engineering. Metamodels are of many types and have diverse applications.

In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including:

Microsoft SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data migration tasks.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume, complexity and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. The goal of data wrangling is to assure quality and useful data. Data analysts typically spend the majority of their time in the process of data wrangling compared to the actual analysis of the data.

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing, for modeling, data analysis and visualization without, or with only minimal, programming.

<span class="mw-page-title-main">Guaraná DSL</span>

Guaraná DSL is a domain-specific language (DSL) to design enterprise application integration (EAI) solutions at a high level of abstraction. The resulting models are platform-independent, so engineers do not need to have skills on a low-level integration technology when designing their solutions. Furthermore, this design can be re-used to automatically generate executable EAI solutions for different target technologies.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

For several years parallel hardware was only available for distributed computing but recently it is becoming available for the low end computers as well. Hence it has become inevitable for software programmers to start writing parallel applications. It is quite natural for programmers to think sequentially and hence they are less acquainted with writing multi-threaded or parallel processing applications. Parallel programming requires handling various issues such as synchronization and deadlock avoidance. Programmers require added expertise for writing such applications apart from their expertise in the application domain. Hence programmers prefer to write sequential code and most of the popular programming languages support it. This allows them to concentrate more on the application. Therefore, there is a need to convert such sequential applications to parallel applications with the help of automated tools. The need is also non-trivial because large amount of legacy code written over the past few decades needs to be reused and parallelized.

Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a data analytics process.

Oracle TopLink is a mapping and persistence framework for Java developers. TopLink is produced by Oracle and is a part of Oracle's OracleAS, WebLogic, and OC4J servers. It is an object-persistence and object-transformation framework. TopLink provides development tools and run-time functionalities that ease the development process and help increase functionality. Persistent object-oriented data is stored in relational databases which helps build high-performance applications. Storing data in either XML or relational databases is made possible by transforming it from object-oriented data.

References

  1. 1 2 3 4 CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html Archived 2017-08-29 at the Wayback Machine
  2. 1 2 3 DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation tool. Retrieved from: http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf Archived 2019-08-05 at the Wayback Machine
  3. DWBIMASTER. Top 10 ETL Tools. Retrieved from: http://dwbimaster.com/top-10-etl-tools/ Archived 2017-08-29 at the Wayback Machine
  4. Petr Aubrecht, Zdenek Kouba. Metadata driven data transformation. Retrieved from: http://labe.felk.cvut.cz/~aubrech/bin/Sumatra.pdf Archived 2021-04-16 at the Wayback Machine
  5. LearnDataModeling.com. Code Generators. Retrieved from: http://www.learndatamodeling.com/tm_code_generator.php Archived 2017-08-02 at the Wayback Machine
  6. 1 2 TDWI. 10 Rules for Real-Time Data Integration. Retrieved from: https://tdwi.org/Articles/2012/12/11/10-Rules-Real-Time-Data-Integration.aspx?Page=1 Archived 2017-08-29 at the Wayback Machine
  7. 1 2 3 Tope Omitola, Andr´e Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins, and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf Archived 2016-01-31 at the Wayback Machine
  8. 1 2 3 The Value of Data Transformation
  9. 1 2 3 4 Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics. Retrieved from: https://digital.lib.washington.edu/researchworks/handle/1773/35165 Archived 2017-09-07 at the Wayback Machine
  10. 1 2 3 McKinsey.com. Using Agile to Accelerate Data Transformation
  11. "Why Self-Service Prep Is a Killer App for Big Data". Datanami. 2016-05-31. Archived from the original on 2017-09-21. Retrieved 2017-09-20.
  12. Sergio, Pablo (2022-05-27). "Your Practical Guide to Data Transformation". Coupler.io Blog. Archived from the original on 2022-05-17. Retrieved 2022-07-08.
  13. Tope Omitola , Andr´e Freitas , Edward Curry , Sean O’Riain , Nicholas Gibbins , and Nigel Shadbolt. Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from: http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf Archived 2016-01-31 at the Wayback Machine
  14. Peng Cong, Zhang Xiaoyi. Research and Design of Interactive Data Transformation and Migration System for Heterogeneous Data Sources. Retrieved from: https://ieeexplore.ieee.org/document/5211525/ Archived 2018-06-07 at the Wayback Machine
  15. DMOZ. Extraction and Transformation. Retrieved from: https://dmoztools.net/Computers/Software/Databases/Data_Warehousing/Extraction_and_Transformation/ Archived 2017-08-29 at the Wayback Machine
  16. "Wrangle Language - Trifacta Wrangler - Trifacta Documentation". docs.trifacta.com. Archived from the original on 2017-09-21. Retrieved 2017-09-20.
  17. 1 2 Kandel, Joe Hellerstein, Sean. "Advantages of a Domain-Specific Language Approach to Data Transformation - Strata + Hadoop World in New York 2014". conferences.oreilly.com. Archived from the original on 2017-09-21. Retrieved 2017-09-20.{{cite web}}: CS1 maint: multiple names: authors list (link)