So far, I have assumed that it is practical or permissible to extract data from source systems and to load it into a centralised data warehouse. What, however, if this is not the case? Even if it is, it can take many months to incorporate a new data source into a data warehouse.
An additional tool in the data-driven organisation’s armoury is data virtualization and it can address these and a number of other potential problems.
Data virtualization is a middleware layer that sits between sources of data and consumers of that data. Its primary objective is to provide governed access, through a single interface, to the data needed, irrespective of where or how it is stored.
Data Virtualization – An Overview
Let’s take a look at the main functional blocks of data virtualization first then consider applications:
Connecting to and interacting with data sources – every data source has its own means of communicating with the outside world and the main problem with standards is that they are usually only standard to a point. Even SQL, beyond a core set of agreed syntax, varies significantly between different database systems. Add to that an array of specialist databases, Hadoop systems, file access and application APIs then the task of creating and maintaining connections with them, and handling variances in their SQL or other API syntax, can be daunting. This is, however, one of the tasks taken care of by a data virtualization solution and a good one, such as TIBCO Data Virtualization, allows mapping of new data sources.
Connecting to and interacting with data consumers – by data consumers I mean any program that needs access to the underlying data sources such as business intelligence or analytics tools, business applications and other systems; or even desktop tools such as Excel. A data virtualization solution will present common interfaces to these consumers such as ODBC, JDBC, REST, SOAP and others.
Providing Semantic Layers and Metadata Support – generally speaking, the way data is described in a database is not the way business users represent it. Semantic layers enable data to be presented to users in business, rather than technical, terms. TIBCO Data Virtualization can support multiple semantic layers, enabling multi-domain or multi-lingual labelling of data fields.
Similarly, metadata can provide useful context to data, making it more easily governed and consumed.
Managing Security – this is a very important function of data virtualization. If all of your data resides in a single, centralised data warehouse it is relatively easy to apply robust and consistent security mechanisms. In an environment where this is not the case, however, synchronising security policies across all data sources can be a mammoth task. By removing direct access to the source systems and directing query requests via the data virtualization layer, a robust set of security rules can be applied and managed centrally, irrespective of the system providing the data or consuming it.
Query Federation – imagine that you are a global organisation, with significant in-country data warehouses around the globe. It may not be practical, or permissible, to move all of that data to a central consolidated data warehouse and keep it synchronised. One answer to this is to only move highly aggregated data to a central location however, what if you need to be able to analyse detailed information or an aggregation that has not previously been allowed for? A data virtualization solution can take a query submitted by a consumer tool, e.g. a dashboard, identify where the data to service that query resides (possibly several remote systems), create multiple queries to obtain calculated results from the various source systems involved and process them to return a consolidated response to be returned to the consumer. The source systems do not even have to be of the same type and a single consumer query could be serviced by, for example, several databases, SAP and a Hadoop data lake. The data virtualization system can also cache data in a local data store to reduce repetitive queries or to speed up particularly slow data source access.
Typical Applications of Data Virtualization
I don’t see the discussion as “should I use an enterprise data warehouse or data virtualization?” as the two are generally complimentary. Typical applications of data virtualization include:
Distributed “virtual” data warehousing – As already mentioned, a common application for data virtualization is to federate queries across source systems distributed geographically.
Heterogeneous system access – to permit querying across diverse technologies, such as databases, Hadoop, application APIs and files.
Data source prototyping – as mentioned in the opening paragraph, it can take months to incorporate a new data source into a data warehouse. Data sources with unproven value are unlikely to make it onto the IT to-do list and the useful lifetime of the source may be shorter than the time to incorporate it. Using data virtualization, new data sources can be integrated very quickly and either discarded at the end of their useful life or, once their value is proven, make it through a business case to incorporate them into the corporate data warehouse.
System abstraction to support architecture change – as has already been mentioned, data virtualization presents an interface to the consumer application that is independent of the source systems. Implementing data virtualization, therefore, permits these consumer systems to be insulated from (even significant) changes to source systems, such as might be experienced during a technical transformation.
Providing governed data services – as I mentioned in my last blog, the requirements to share data securely are growing and data virtualization can enable this in a heterogeneous environment. Using data virtualization, data can be securely shared through mechanisms such as REST APIs without having to make modifications to the source systems.
Data virtualization is not for everyone, however if any of the scenarios outlined in this blog apply to your company then it is a powerful component of a modern data architecture, bringing unique solutions to seemingly intractable problems.
I feel compelled to apologise to my fellow countrymen and women for the use of the American spelling of virtualization. The truth is that all literature refers to it in this manner and so, King Canute-like, I have decided not to try to hold back the waves here.