Data virtualization is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a

single customer view A single customer view is an aggregated, consistent and holistic representation of the data held by an organisation about its customers that can be viewed in one place, such as a single page. The advantage to an organisation of attaining this unifi ...

(or single view of any other entity) of the overall data. Unlike the traditional

extract, transform, load In computing, extract, transform, load (ETL) is a three-phase process where data is extracted, transformed (cleaned, sanitized, scrubbed) and loaded into an output data container. The data can be collated from one or more sources and it can also ...

("ETL") process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a

federated database system A federated database system (FDBS) is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer netwo ...

). The technology also supports the writing of transaction data updates back to the source systems."Data virtualisation on rise as ETL alternative for data integration"
Gareth Morgan, Computer Weekly, retrieved 19 August 2013 To resolve differences in source and consumer formats and semantics, various abstraction and transformation techniques are used. This concept and software is a subset of

data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...

and is commonly used within

business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...

service-oriented architecture In software engineering, service-oriented architecture (SOA) is an architectural style that focuses on discrete services instead of a monolithic design. By consequence, it is also applied in the field of software design where services are provide ...

data services,

cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...

enterprise search Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience. "Enterprise search" is used to describe the software of search information within an ente ...

, and

master data management Master data management (MDM) is a technology-enabled discipline in which business and information technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared ...

Applications, benefits and drawbacks

The defining feature of data virtualization is that the data used remains in its original locations and real-time access is established to allow analytics across multiple sources. This aids in resolving some technical difficulties such as compatibility problems when combining data from various platforms, lowering the risk of error caused by faulty data, and guaranteeing that the newest data is used. Furthermore, avoiding the creation of a new database containing personal information can make it easier to comply with privacy regulations. As a result, data virtualization creates new possibilities for data use. However, with data virtualization, the connection to all necessary data sources must be operational as there is no local copy of the data, which is one of the main drawbacks of the approach. Connection problems occur more often in complex systems where one or more crucial sources will occasionally be unavailable. Smart data buffering, such as keeping the data from the most recent few requests in the virtualization system buffer can help to mitigate this issue. Moreover, because data virtualization solutions may use large numbers of network connections to read the original data and server virtualised tables to other solutions over the network, system security requires more consideration than it does with traditional data lakes. In a conventional data lake system, data can be imported into the lake by following specific procedures in a single environment. When using a virtualization system, the environment must separately establish secure connections with each data source, which is typically located in a different environment from the virtualization system itself. Security of personal data and compliance with regulations can be a major issue when introducing new services or attempting to combine various data sources. When data is delivered for analysis, data virtualisation can help to resolve privacy-related problems. Virtualization makes it possible to combine personal data from different sources without physically copying them to another location while also limiting the view to all other collected variables. However, virtualization does not eliminate the requirement to confirm the security and privacy of the analysis results before making them more widely available. Regardless of the chosen data integration method, all results based on personal level data should be protected with the appropriate privacy requirements.

Data virtualization and data warehousing

Some enterprise landscapes are filled with disparate data sources including multiple

data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business reporting, reporting and data analysis and is considered a core component of business intelligence. DWs are central Repos ...

data mart A data mart is a structure/access pattern specific to ''data warehouse'' environments, used to retrieve client-facing data. The data mart is a subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data w ...

s, and/or

data lake A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transforme ...

s, even though a Data Warehouse, if implemented correctly, should be unique and a

single source of truth In information science and information technology, single source of truth (SSOT) architecture, or single point of truth (SPOT) architecture, for information systems is the practice of structuring information models and associated data schemas su ...

. Data virtualization can efficiently bridge data across data warehouses, data marts, and data lakes without having to create a whole new integrated physical data platform. Existing data infrastructure can continue performing their core functions while the data virtualization layer just leverages the data from those sources. This aspect of data virtualization makes it complementary to all existing data sources and increases the availability and usage of enterprise data. Data virtualization may also be considered as an alternative to ETL and data warehousing but for performance considerations it's not really recommended for a very large data warehouse. Data virtualization is inherently aimed at producing quick and timely insights from multiple sources without having to embark on a major data project with extensive ETL and data storage. However, data virtualization may be extended and adapted to serve data warehousing requirements also. This will require an understanding of the data storage and history requirements along with planning and design to incorporate the right type of data virtualization, integration, and storage strategies, and infrastructure/performance optimizations (e.g., streaming, in-memory, hybrid storage).

Examples

* The Phone House—the trading name for the European operations of UK-based mobile phone retail chain

Carphone Warehouse The Carphone Warehouse Limited was a mobile phone retailer based in London, United Kingdom. In August 2014 the company became a subsidiary of Currys plc (previously named "Dixons Carphone"), which was formed by the merger of its former parent Ca ...

—implemented Denodo’s data virtualization technology between its Spanish subsidiary’s transactional systems and the Web-based systems of mobile operators. *

Novartis Novartis AG is a Swiss-American multinational pharmaceutical corporation based in Basel, Switzerland and Cambridge, Massachusetts, United States (global research).name="novartis.com">https://www.novartis.com/research-development/research-loc ...

implemented

TIBCO TIBCO Software Inc. is an American business intelligence software company founded in 1997 in Palo Alto, California. It has headquarters in Palo Alto, California, and offices in North America, Europe, Asia, the Middle East, Africa and South A ...

's data virtualization tool to enable its researchers to quickly combine data from both internal and external sources into a searchable virtual data store. * The storage-agnostic Primary Data (defunct, reincarnated as Hammerspace) was a data virtualization platform that enabled applications, servers, and clients to transparently access data while it was migrated between direct-attached, network-attached, private and public cloud storage. *

Linked Data In computing, linked data (often capitalized as Linked Data) is structured data which is interlinked with other data so it becomes more useful through semantic queries. It builds upon standard Web technologies such as HTTP, RDF and URIs, but r ...

can use a single hyperlink-based Data Source Name ( DSN) to provide a connection to a virtual database layer that is internally connected to a variety of back-end data sources using

ODBC In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...

JDBC Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. I ...

OLE DB OLE DB (''Object Linking and Embedding, Database'', sometimes written as OLEDB or OLE-DB), an API designed by Microsoft, allows accessing data from a variety of sources in a uniform manner. The API provides a set of interfaces implemented using ...

, ADO.NET, SOA-style services, and/or

REST Rest or REST may refer to: Relief from activity * Sleep ** Bed rest * Kneeling * Lying (position) * Sitting * Squatting position Structural support * Structural support ** Rest (cue sports) ** Armrest ** Headrest ** Footrest Arts and entert ...

patterns. *

Database virtualization Database virtualization is the decoupling of the database layer, which lies between the storage and application layers within the application stack. Virtualization of the database layer enables a shift away from the physical, toward the logical or ...

may use a single ODBC-based DSN to provide a connection to a similar virtual database layer. *

Alluxio Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & ...

, an open-source virtual distributed file system (VDFS), started at the

University of California, Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...

AMPLab AMPLAB was a University of California, Berkeley lab focused on big data analytics located in Soda Hall. The name stands for the Algorithms, Machines and People Lab. It has been publishing papers since 2008 and was officially launched in 2011. The ...

. The system abstracts data from various file systems and object stores.

Functionality

Data Virtualization software provides some or all of the following capabilities: * Abstraction – Abstract the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology. * Virtualized Data Access – Connect to different data sources and make them accessible from a common logical data access point. * Transformation – Transform, improve quality, reformat, aggregate etc. source data for consumer use. * Data Federation – Combine result sets from across multiple source systems. * Data Delivery – Publish result sets as views and/or data services executed by client application or users when requested. Data virtualization software may include functions for development, operation, and/or management. A

metadata engine A metadata engine collects, stores and analyzes information about data and metadata (data about data) in use within a domain. It virtualizes the view of data for an application by separating the data (physical) path from the metadata (logical) path ...

collects, stores and analyzes information about data and metadata (data about data) in use within a domain. Benefits include: * Reduce risk of data errors * Reduce systems workload through not moving data around * Increase speed of access to data on a real-time basis * Allows for query processing pushed down to data source instead of in middle tier * Most systems enable self-service creation of virtual databases by end users with access to source systems * Increase governance and reduce risk through the use of policies * Reduce data storage required Drawbacks include: * May impact Operational systems response time, particularly if under-scaled to cope with unanticipated user queries or not tuned early on., IT pros reveal benefits, drawbacks of data virtualization software"
Mark Brunelli, SearchDataManagement, 11 October 2012 * Does not impose a heterogeneous data model, meaning the user has to interpret the data, unless combined with

Data Federation In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. ...

and business understanding of the data"The Pros and Cons of Data Virtualization"
Loraine Lawson, BusinessEdge, 7 October 2011 * Requires a defined Governance approach to avoid budgeting issues with the shared services * Not suitable for recording the historic snapshots of data. A data warehouse is better for this * Change management "is a huge overhead, as any changes need to be accepted by all applications and users sharing the same virtualization kit" * Designers should always keep performance considerations in mind Avoid usage: * For accessing Operational Data Systems (Performance and Operational Integrity issues) * For federating or centralizing all data of the organization (Security and hacking issues) * For building very large virtual Data warehouse (Performance issues) * As an ETL process (Governance and performance issues) * If you have only one or two data sources to virtualize

History

Enterprise information integration Enterprise information integration (EII) is the ability to support an unified view of data and information for an entire organization. In a data virtualization application of EII, a process of information integration, using data abstraction to pr ...

(EII) (first coined by Metamatrix), now known as Red Hat JBoss Data Virtualization, and

s are terms used by some vendors to describe a core element of data virtualization: the capability to create relational JOINs in a federated VIEW.

References

{{reflist