Data virtualization

Last updated

Data virtualization is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, [1] and can provide a single customer view (or single view of any other entity) of the overall data. [2]

Contents

Unlike the traditional extract, transform, load ("ETL") process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a federated database system). The technology also supports the writing of transaction data updates back to the source systems. [3] To resolve differences in source and consumer formats and semantics, various abstraction and transformation techniques are used. This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management.

Applications, benefits and drawbacks

The defining feature of data virtualization is that the data used remains in its original locations and real-time access is established to allow analytics across multiple sources. This aids in resolving some technical difficulties such as compatibility problems when combining data from various platforms, lowering the risk of error caused by faulty data, and guaranteeing that the newest data is used. Furthermore, avoiding the creation of a new database containing personal information can make it easier to comply with privacy regulations. As a result, data virtualization creates new possibilities for data use. [4]

Building on this, data virtualization's real value, particularly for users, is its declarative approach. Unlike traditional data integration methods that require specifying every step of integration, this approach can be less error-prone and more efficient. Traditional methods are tedious, especially when adapting to changing requirements, involving changes at multiple steps. Data virtualization, in contrast, allows users to simply describe the desired outcome. The software then automatically generates the necessary steps to achieve this result. If the desired outcome changes, updating the description suffices, and the software adjusts the intermediate steps accordingly. This flexibility can accelerate processes by up to five times, underscoring the primary advantage of data virtualization. [5]

However, with data virtualization, the connection to all necessary data sources must be operational as there is no local copy of the data, which is one of the main drawbacks of the approach. Connection problems occur more often in complex systems where one or more crucial sources will occasionally be unavailable. Smart data buffering, such as keeping the data from the most recent few requests in the virtualization system buffer can help to mitigate this issue. [4]

Moreover, because data virtualization solutions may use large numbers of network connections to read the original data and server virtualised tables to other solutions over the network, system security requires more consideration than it does with traditional data lakes. In a conventional data lake system, data can be imported into the lake by following specific procedures in a single environment. When using a virtualization system, the environment must separately establish secure connections with each data source, which is typically located in a different environment from the virtualization system itself. [4]

Security of personal data and compliance with regulations can be a major issue when introducing new services or attempting to combine various data sources. When data is delivered for analysis, data virtualisation can help to resolve privacy-related problems. Virtualization makes it possible to combine personal data from different sources without physically copying them to another location while also limiting the view to all other collected variables. However, virtualization does not eliminate the requirement to confirm the security and privacy of the analysis results before making them more widely available. Regardless of the chosen data integration method, all results based on personal level data should be protected with the appropriate privacy requirements. [4]

Data virtualization and data warehousing

Some enterprise landscapes are filled with disparate data sources including multiple data warehouses, data marts, and/or data lakes, even though a Data Warehouse, if implemented correctly, should be unique and a single source of truth. Data virtualization can efficiently bridge data across data warehouses, data marts, and data lakes without having to create a whole new integrated physical data platform. Existing data infrastructure can continue performing their core functions while the data virtualization layer just leverages the data from those sources. This aspect of data virtualization makes it complementary to all existing data sources and increases the availability and usage of enterprise data.[ citation needed ]

Data virtualization may also be considered as an alternative to ETL and data warehousing but for performance considerations it's not really recommended for a very large data warehouse. Data virtualization is inherently aimed at producing quick and timely insights from multiple sources without having to embark on a major data project with extensive ETL and data storage. However, data virtualization may be extended and adapted to serve data warehousing requirements also. This will require an understanding of the data storage and history requirements along with planning and design to incorporate the right type of data virtualization, integration, and storage strategies, and infrastructure/performance optimizations (e.g., streaming, in-memory, hybrid storage).[ citation needed ]

Examples

Functionality

Data Virtualization software provides some or all of the following capabilities: [7]

Data virtualization software may include functions for development, operation, and/or management.[ citation needed ]

A metadata engine collects, stores and analyzes information about data and metadata (data about data) in use within a domain. [8] [ clarification needed ]

Benefits include:

Drawbacks include:

Avoid usage:

History

Enterprise information integration (EII) (first coined by Metamatrix), now known as Red Hat JBoss Data Virtualization, and federated database systems are terms used by some vendors to describe a core element of data virtualization: the capability to create relational JOINs in a federated VIEW.[ citation needed ][ clarification needed ]

Technology

Some data virtualization solutions and vendors:

Another more up-to-date list with user rankings is compiled by Gartner. [28]

See also

Related Research Articles

<span class="mw-page-title-main">Data warehouse</span> Centralized storage of knowledge

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating reports. This is beneficial for companies as it enables them to interrogate and draw insights from their data and make decisions.

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form.

<span class="mw-page-title-main">Extract, transform, load</span> Procedure in computing

In computing, extract, transform, load (ETL) is a three-phase process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collated from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on recurring schedules either as single jobs or aggregated into a batch of jobs.

Enterprise information integration (EII) is the ability to support a unified view of data and information for an entire organization. In a data virtualization application of EII, a process of information integration, using data abstraction to provide a unified interface for viewing all the data within an organization, and a single set of structures and naming conventions to represent this data; the goal of EII is to get a large set of heterogeneous data sources to appear to a user or system as a single, homogeneous data source.

Enterprise content management (ECM) extends the concept of content management by adding a timeline for each content item and, possibly, enforcing processes for its creation, approval, and distribution. Systems using ECM generally provide a secure repository for managed items, analog or digital. They also include one methods for importing content to manage new items, and several presentation methods to make items available for use. Although ECM content may be protected by digital rights management (DRM), it is not required. ECM is distinguished from general content management by its cognizance of the processes and procedures of the enterprise for which it is created.

Enterprise software, also known as enterprise application software (EAS), is computer software used to satisfy the needs of an organization rather than its individual users. Enterprise software is an integral part of a computer-based information system, handling a number of business operations, for example to enhance business and management reporting tasks, or support production operations and back office functions. Enterprise systems must process information at a relatively high speed.

Veritas Backup Exec is a data protection software product designed for customers with mixed physical and virtual environments, and who are moving to public cloud services. Supported platforms include VMware and Hyper-V virtualization, Windows and Linux operating systems, Amazon S3, Microsoft Azure and Google Cloud Storage, among others. All management and configuration operations are performed with a single user interface. Backup Exec also provides integrated deduplication, replication, and disaster recovery capabilities and helps to manage multiple backup servers or multi-drive tape loaders.

Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial and scientific domains. Data integration appears with increasing frequency as the volume, complexity and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information.

<span class="mw-page-title-main">Apache OFBiz</span> Open-source enterprise resource planning software

Apache OFBiz is an open source enterprise resource planning (ERP) system. It provides a suite of enterprise applications that integrate and automate many of the business processes of an enterprise.

<span class="mw-page-title-main">WaveMaker</span> Low-code programming platform

WaveMaker is a Java-based low-code development platform designed for building software applications and platforms. The company, WaveMaker Inc., is based in Mountain View, California. The platform is intended to assist enterprises in speeding up their application development and IT modernization initiatives through low-code capabilities. Additionally, for independent software vendors (ISVs), WaveMaker serves as a customizable low-code component that integrates into their products.

oVirt Free, open-source virtualization management platform

oVirt is a free, open-source virtualization management platform. It was founded by Red Hat as a community project on which Red Hat Virtualization is based. It allows centralized management of virtual machines, compute, storage and networking resources, from an easy-to-use web-based front-end with platform independent access. KVM on x86-64, PowerPC64 and s390x architecture are the only hypervisors supported, but there is an ongoing effort to support ARM architecture in a future releases.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

Innovative Routines International (IRI), Inc. is an American software company first known for bringing mainframe sort merge functionality into open systems. IRI was the first vendor to develop a commercial replacement for the Unix sort command, and combine data transformation and reporting in Unix batch processing environments. In 2007, IRI's coroutine sort ("CoSort") became the first product to collate and convert multi-gigabyte XML and LDIF files, join and lookup across multiple files, and apply role-based data privacy functions for fields within sensitive files.

<span class="mw-page-title-main">Pentaho</span> Business intelligence software

Pentaho is the brand name for several Data Management software products that make up the Pentaho+ Data Platform. These include Pentaho Data Integration, Pentaho Business AnalyticsPentaho Data Catalog, and Pentaho Data Optimiser. The Pentaho+ Platform helps organisations to become “data-fit” prior to operationalising AI.

The JBoss Enterprise Application Platform is a subscription-based/open-source Java EE-based application server runtime platform used for building, deploying, and hosting highly-transactional Java applications and services developed and maintained by Red Hat. The JBoss Enterprise Application Platform is part of Red Hat's Enterprise Middleware portfolio of software. Because it is Java-based, the JBoss application server operates across platforms; it is usable on any operating system that supports Java. JBoss Enterprise Application Platform was originally called JBoss and was developed by the eponymous company JBoss, acquired by Red Hat in 2006.

HP ConvergedSystem is a portfolio of system-based products from Hewlett-Packard (HP) that integrates preconfigured IT components into systems for virtualization, cloud computing, big data, collaboration, converged management, and client virtualization. Composed of servers, storage, networking, and integrated software and services, the systems are designed to address the cost and complexity of data center operations and maintenance by pulling the IT components together into a single resource pool so they are easier to manage and faster to deploy. Where previously it would take three to six months from the time of order to get a system up and running, it now reportedly takes as few as 20 days with the HP ConvergedSystem.

<span class="mw-page-title-main">SAP HANA</span> Database management system by SAP

SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. Its primary function as the software running a database server is to store and retrieve data as requested by the applications. In addition, it performs advanced analytics and includes extract, transform, load (ETL) capabilities as well as an application server.

<span class="mw-page-title-main">Oracle Cloud</span> Cloud computing service

Oracle Cloud is a cloud computing service offered by Oracle Corporation providing servers, storage, network, applications and services through a global network of Oracle Corporation managed data centers. The company allows these services to be provisioned on demand over the Internet.

References

  1. "What is Data Virtualization?", Margaret Rouse, TechTarget.com, retrieved 19 August 2013
  2. Streamlining Customer Data
  3. 1 2 3 "Data virtualisation on rise as ETL alternative for data integration" Gareth Morgan, Computer Weekly, retrieved 19 August 2013
  4. 1 2 3 4 Paiho, Satu; Tuominen, Pekka; Rökman, Jyri; Ylikerälä, Markus; Pajula, Juha; Siikavirta, Hanne (2022). "Opportunities of collected city data for smart cities". IET Smart Cities. 4 (4): 275–291. doi: 10.1049/smc2.12044 . S2CID   253467923.
  5. 1 2 "The True Value of Data Virtualization: Beyond Marketing Buzzwords", Nick Golovin, medium.com, retrieved 14 November 2023
  6. "Hammerspace - A True Global File System". Hammerspace. Retrieved 2021-10-31.
  7. Summan, Jesse; Handmaker, Leslie (2022-12-20). "Data Federation vs. Data Virtualization". StreamSets. Retrieved 2024-02-08.
  8. Kendall, Aaron. "Metadata-Driven Design: Designing a Flexible Engine for API Data Retrieval". InfoQ. Retrieved 25 April 2017.
  9. "Rapid Access to Disparate Data Across Projects Without Rework" Informatica, retrieved 19 August 2013
  10. Data virtualization: 6 best practices to help the business 'get it' Joe McKendrick, ZDNet, 27 October 2011
  11. |IT pros reveal benefits, drawbacks of data virtualization software" Mark Brunelli, SearchDataManagement, 11 October 2012
  12. 1 2 3 "The Pros and Cons of Data Virtualization" Archived 2014-08-05 at the Wayback Machine Loraine Lawson, BusinessEdge, 7 October 2011
  13. "IBM Data Virtualization". www.ibm.com. Retrieved 2024-04-09.
  14. https://www.actifio.com/company/blog/post/enterprise-data-service-new-copy-data-virtualization/ [ bare URL ]
  15. "Ultrawrap - Semantic Web Standards". www.w3.org. Retrieved 2024-04-09.
  16. "Data Virtuality - Integrate data for better-informed decisions". Data Virtuality. Retrieved 2024-04-09.
  17. "My Blog – My WordPress Blog". 2023-09-19. Retrieved 2024-04-09.
  18. "The industry leading data company for DevOps". Delphix. Retrieved 2024-04-09.
  19. "Denodo is a leader in data management". Denodo. 2014-09-03. Retrieved 2024-04-09.
  20. https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RWJFdq [ bare URL ]
  21. "Home". Querona Data Virtualization. Retrieved 2024-04-09.
  22. "Getting Started Guide Red Hat JBoss Data Virtualization 6.4 | Red Hat Customer Portal". access.redhat.com. Retrieved 2024-04-09.
  23. "Stone Bond Technologies | Advanced Data Integration Platform Solution". Stone Bond Technologies. Retrieved 2024-04-09.
  24. "Stratio Business Semantic Data Layer delivers 99% answer accuracy for LLMs". Stratio. 2024-01-15. Retrieved 2024-04-09.
  25. "Teiid". teiid.io. Retrieved 2024-04-09.
  26. "Managing the Veritas provisioning file system (VPFS) configuration parameters | Managing NetBackup services from the deduplication shell | Accessing NetBackup WORM storage server instances for management tasks | Managing NetBackup application instances | NetBackup™ 10.2.0.1 Application Guide | Veritas™". www.veritas.com. Retrieved 2024-04-09.
  27. "XAware Data Integration Project". SourceForge. 2016-04-06. Retrieved 2024-04-09.
  28. "Best Data Virtualization Reviews". Gartner . 2024. Retrieved 2024-02-07.

Further reading