Data catalog & data virtualization - why do you need both?
Data-driven business is a hot topic. But how can the existing data be made to serve business and what kinds of technologies are actually needed for it? And what should you require from the solutions on the market right now?
Everything starts with the need for data. The user may know what kind of data they need right away, or they may have some sort of idea of it, in which case they want to browse the existing data in a directory. The mere existence of data is not enough – one must also be able to access it. Data catalogs describe data in various locations, making it easier to find. Data virtualization, on the other hand, offers real-time access to scattered data. Catalogs and virtualization complement each other very well.
Data catalogs – how do they actually work?
Data catalogs contain automatic functions that can be used to collect metadata that describes the data from various sources. In modern tools, the automation of data collection is made more efficient by artificial intelligence, which automatically classifies the data it finds. Automation can easily identify the names, address information, credit card numbers, country codes, etc., in data sets. The tools can be taught to find any information you need in the data. Even the images embedded in the data can be used to identify people and objects, which are then used to create metadata that makes the information easier to find.
In modern data catalogs, users can also produce more metadata themselves by, for example, adding comments to the data in the catalog. The keywords for data catalogs are data discoverability and the metadata describing data content.
The more data an organization has, the more difficult it is to gather it in one place for utilization.
Data virtualization – making data available in a controlled way
Data virtualization makes it possible to access the data you need without first transferring it from the original source to a centralized data warehouse. Virtualization performs the data transfer and modification for use specified in the data integration only when the data is needed. The more data an organization has, the more difficult it is to gather it in one place for utilization. Quite quickly one gets to the point where copying all data to the data warehouse for potential secondary use is no longer profitable or sensible. A better option would be to manage metadata about where the data can be found if needed and how it can be accessed.
With data virtualization, organizations can better manage who has access to data sets and monitor the use of data more efficiently. When access control is located in one place, it’s possible to be strict if needed when it comes to access control and the monitoring of use. The controlled accessibility of data is key in virtualization.
A summary of typical functions related to data catalogs and data virtualization.
Data traceability – i.e. where data is located and how the data was created and transferred to its current location – is important.
Why do you need both?
In order to work, virtualization requires a data list that contains the parameters of the data source, i.e. the information about how to connect to the data source. The data catalog also requires connecting to data sources, even though only metadata is retrieved from the sources. Data traceability – i.e. where data is located and how the data was created and transferred to its current location – is important for both technologies. Both also very efficiently support the self-service use of data, which is popular nowadays.
There are technology suppliers in the market that are specialized in either data catalogs or data virtualization. In these cases, it’s common that the functionality that complements the entire solution is provided through partnerships. However, the data catalog and data virtualization are so closely linked that a well-integrated solution from a single supplier is often better than separate solutions.
The IBM Cloud Pak for Data product package contains the leading products on the market for both areas as a ready-to-use integrated solution. I strongly agree with Forrester’s* opinion that the solution offers undeniable benefits and savings in customers’ data architecture.
Is your business data-driven?
Mika Naatula is the CTO of Enfo’s Data and Analytics business area
*A Forrester New Technology: The Projected Total Economic Impact™ Of IBM Cloud Pak For Data, December 2020.