The Data Catalog - A Critical Component to Big Data Success

Critical Risk in Big Data Projects

As Data Professionals, it is our responsibility to alert our organizations to a major risk to new advanced analytics project success.  We have the expertise to understand the capabilities that are missing from the vast majority of strategic Data projects in our organizations.  “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient” is one of the strategic planning assumptions by Gartner Group in their research published last month.

All of the analyst organizations, including Gartner Group and Forester, have predicted that the lack of built in metadata repositories in Big Data technology (such as Hadoop) will be one of the biggest risks of project success for Big Data projects.  However, not only don’t many Big Data solution architects realize that they don’t have these capabilities built in to their software distributions, few have the expertise to understand what functional capabilities they are missing: data access security solutions (usually by role and asset), audit trails of data update and access (operational metadata), inventory of data assets (technical and business metadata).

So here we are talking about metadata again.  And it wasn’t a particularly interesting topic last time.  But every data industry analyst, expert, and thought leader is saying that this has suddenly become a critical concern in the Big Data space.

Critical to Enterprise Data Management: an Enterprise Data Inventory

The first thing to do when managing assets is inventory them.  Can you imagine trying to manage the PCs owned by your company without having an inventory?  We do the same thing with computer applications: projects such as Year 2000 and Risk Management started with an inventory of the applications in the organization, although not all organizations have maintained their application inventory following the strategic projects that required them.

Inventories of Data in the organization tend to be separate by technology: each relational database management system or even database instance maintains a separate technical metadata repository; email is usually coordinated centrally; most organizations have an enterprise document management system.  The rise of Big Data technology is making Data Management even harder: distributed data technology (i.e. cloud) means data is not in a single location anymore (the data center); volumes of Data make manual management impossible; and, most significantly, many Big Data technologies (i.e. Hadoop) do not have built in Governance solutions for business, technical, and operational metadata, or for data security.


An Overview of the Data Catalog

The critical business process that needs improvement in all organizations is getting Data to the Data Consumers in the organization, whether they are Data Scientists, Business Analysts, Report Writers, or any individual in the organization who needs data.  Processes need to be in place for someone in the organization to request access to data, have the access approved by the correct responsible party, and provide access to the data.  These processes need to be greatly automated and take minutes or (at most) a few hours.  This capability is sometimes called “Data as a Service”, and is part of the Service Management and Process Improvement initiatives. The involvement of the IT organization should be minimal; they have much more significant work to do than to be a bottleneck in the access to Data.  There is an initial assumption in this optimized process: there must be a Data Catalog available to the Data Consumer to tell them what Data is available in the organization.  There should also be a business process in place for a computer system to get access to the Data or functions of another computer system, sometimes called Data Access Agreements.  This may be a more complicated or risky process, but still needs to take, at most, a week or so to complete.  The point is that the lack of swift access to data should not be the measure by which the enterprise finds their IT organization failing them.

The scope for a Data Catalog may be a single application focused on use by Data Consumers, such as a Data Warehouse and / or a Data Lake, but there are very strong benefits when the Data Catalog spans the Enterprise.  The step where technical access to data is provided to the Data Consumer (“provisioning”) may still need to be a distributed function when the Data Catalog goes across multiple technologies.

There are a few important dimensions to the Data Catalog.  First of all, the primary audience of the Data Catalog is individuals in the organization, so what they should be presented with is a “menu” with a brief description of the business meaning of the data available.  I believe the level of this information should be at the data store (table, file, database, schema, server, or directory) level.  If business meaning is available at the attribute level that is a benefit but few organizations feel that level of detail is worthwhile.  Since the business meaning of the data in a data store probably needs to be generated by a person, we want to establish a reasonable minimal requirement which is achievable.  Links from the Business Metadata (business meaning) need to be created to the physical Data Inventory (technical metadata).

The central component of the Data Catalog is the Inventory, the technical description of the structure and physical location of the data store (data asset).  The Inventory should be generated and maintained in an automated fashion because otherwise it is probably doomed to be incorrect. Once again, my experience is that people want to request access to data at the data store level (table, file, database, schema, server, or directory), not by the individual field.  Technical metadata can usually be automatically captured at the individual field level, and this is very useful to the Data Consumer, although this is more challenging for some Big Data technologies (Hadoop).  Inventory information that provides the physical location of the data may be available at different granularity: the entire set of data of a specific source or structure versus the individual instances of the data (records, messages, particular timestamps and dates, particular data values).

If the Data Catalog spans multiple technologies then there is an additional technical challenge to establish an operational integration process of the technical metadata from the distributed technical metadata repositories to the central Data Catalog.  In other words, on a regular frequency (daily or weekly) technical metadata (inventory information) needs to be collected (automatically) from the various metadata repositories and stored in the central Data Catalog.  Information (metadata) on any new data stores found need to be presented to an appropriate person (Data Steward) to create a link to the business reference in the Data Catalog.  The data stores represented in the Data Catalog may be relational databases, NoSQL databases, Hadoop clusters, document repositories, file directories, etc.

There is a side requirement that all data in the Data Catalog needs classification of required security and privacy levels, etc.  The Risk Management (Regulatory and Compliance) part of every organization has been struggling with this requirement for a very long time and very few have been able to implement a complete solution.  This is important to IT because their involvement may be required to develop a process that separates data of different security classifications or masks private data either on a one time or on-going basis for different users with different access authority or for use during application development and testing.


New Data to an Organization

To me, the number one problem with Data Warehouses is the amount of time required to incorporate a new data source or type.  I’ve seen six months to a year or never, as the average time to add new data to a Data Warehouse. The resolution of this problem is not to create a Data Lake, but to work on improving the process associated with the Data Warehouse.

However, before the request to add new data to the Data Warehouse or to a Data Lake, someone needs to determine that this data is needed by the organization for some business purpose.  Let’s call this person the “Data Scientist”.  The job of the Data Scientist is to create analytical models and insight for the organization using large amounts of historical data.  During the process of creating these models the Data Scientist will want to include data that is not currently available to current data consumers in the organization, to determine if this additional data is helpful in creating business insight.  This additional data may or may not be useful but the Data Scientist will want to try lots of different data in their models to make this assessment.  Data Scientists in the organization need a sandbox (possibly a segregated area of the Data Lake) for their data and advanced approval to access most or all of the data in the organization plus approval to integrate data from external data sources into their sandbox environment.  IT should not be delaying this important work with some misguided idea that they need to pre-analyze the data in question or document the data unless requested by the Data Scientist. 

Once the Data Scientist, along with their business sponsors, have determined that a new data source needs to be added to the Data Warehouse, Data Lake, or other operational system, then the optimized process of adding the new data to the application can begin, which will include adding the data to the Data Catalog and Inventory.

Return to list