The Data Catalog - A Critical Component to Big Data Success

Critical Risk in Big Data Projects

As Data Professionals, it is our responsibility to alert our organizations to a major risk to new advanced analytics project success.  We have the expertise to understand the capabilities that are missing from the vast majority of strategic Data projects in our organizations.  “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient” is one of the strategic planning assumptions by Gartner Group in their research published last month.

All of the analyst organizations, including Gartner Group and Forester, have predicted that the lack of built in metadata repositories in Big Data technology (such as Hadoop) will be one of the biggest risks of project success for Big Data projects.  However, not only don’t many Big Data solution architects realize that they don’t have these capabilities built in to their software distributions, few have the expertise to understand what functional capabilities they are missing: data access security solutions (usually by role and asset), audit trails of data update and access (operational metadata), inventory of data assets (technical and business metadata).

So here we are talking about metadata again.  And it wasn’t a particularly interesting topic last time.  But every data industry analyst, expert, and thought leader is saying that this has suddenly become a critical concern in the Big Data space.

Critical to Enterprise Data Management: an Enterprise Data Inventory

The first thing to do when managing assets is inventory them.  Can you imagine trying to manage the PCs owned by your company without having an inventory?  We do the same thing with computer applications: projects such as Year 2000 and Risk Management started with an inventory of the applications in the organization, although not all organizations have maintained their application inventory following the strategic projects that required them.

Inventories of Data in the organization tend to be separate by technology: each relational database management system or even database instance maintains a separate technical metadata repository; email is usually coordinated centrally; most organizations have an enterprise document management system.  The rise of Big Data technology is making Data Management even harder: distributed data technology (i.e. cloud) means data is not in a single location anymore (the data center); volumes of Data make manual management impossible; and, most significantly, many Big Data technologies (i.e. Hadoop) do not have built in Governance solutions for business, technical, and operational metadata, or for data security.


An Overview of the Data Catalog

The critical business process that needs improvement in all organizations is getting Data to the Data Consumers in the organization, whether they are Data Scientists, Business Analysts, Report Writers, or any individual in the organization who needs data.  Processes need to be in place for someone in the organization to request access to data, have the access approved by the correct responsible party, and provide access to the data.  These processes need to be greatly automated and take minutes or (at most) a few hours.  This capability is sometimes called “Data as a Service”, and is part of the Service Management and Process Improvement initiatives. The involvement of the IT organization should be minimal; they have much more significant work to do than to be a bottleneck in the access to Data.  There is an initial assumption in this optimized process: there must be a Data Catalog available to the Data Consumer to tell them what Data is available in the organization.  There should also be a business process in place for a computer system to get access to the Data or functions of another computer system, sometimes called Data Access Agreements.  This may be a more complicated or risky process, but still needs to take, at most, a week or so to complete.  The point is that the lack of swift access to data should not be the measure by which the enterprise finds their IT organization failing them.

The scope for a Data Catalog may be a single application focused on use by Data Consumers, such as a Data Warehouse and / or a Data Lake, but there are very strong benefits when the Data Catalog spans the Enterprise.  The step where technical access to data is provided to the Data Consumer (“provisioning”) may still need to be a distributed function when the Data Catalog goes across multiple technologies.

There are a few important dimensions to the Data Catalog.  First of all, the primary audience of the Data Catalog is individuals in the organization, so what they should be presented with is a “menu” with a brief description of the business meaning of the data available.  I believe the level of this information should be at the data store (table, file, database, schema, server, or directory) level.  If business meaning is available at the attribute level that is a benefit but few organizations feel that level of detail is worthwhile.  Since the business meaning of the data in a data store probably needs to be generated by a person, we want to establish a reasonable minimal requirement which is achievable.  Links from the Business Metadata (business meaning) need to be created to the physical Data Inventory (technical metadata).

The central component of the Data Catalog is the Inventory, the technical description of the structure and physical location of the data store (data asset).  The Inventory should be generated and maintained in an automated fashion because otherwise it is probably doomed to be incorrect. Once again, my experience is that people want to request access to data at the data store level (table, file, database, schema, server, or directory), not by the individual field.  Technical metadata can usually be automatically captured at the individual field level, and this is very useful to the Data Consumer, although this is more challenging for some Big Data technologies (Hadoop).  Inventory information that provides the physical location of the data may be available at different granularity: the entire set of data of a specific source or structure versus the individual instances of the data (records, messages, particular timestamps and dates, particular data values).

If the Data Catalog spans multiple technologies then there is an additional technical challenge to establish an operational integration process of the technical metadata from the distributed technical metadata repositories to the central Data Catalog.  In other words, on a regular frequency (daily or weekly) technical metadata (inventory information) needs to be collected (automatically) from the various metadata repositories and stored in the central Data Catalog.  Information (metadata) on any new data stores found need to be presented to an appropriate person (Data Steward) to create a link to the business reference in the Data Catalog.  The data stores represented in the Data Catalog may be relational databases, NoSQL databases, Hadoop clusters, document repositories, file directories, etc.

There is a side requirement that all data in the Data Catalog needs classification of required security and privacy levels, etc.  The Risk Management (Regulatory and Compliance) part of every organization has been struggling with this requirement for a very long time and very few have been able to implement a complete solution.  This is important to IT because their involvement may be required to develop a process that separates data of different security classifications or masks private data either on a one time or on-going basis for different users with different access authority or for use during application development and testing.


New Data to an Organization

To me, the number one problem with Data Warehouses is the amount of time required to incorporate a new data source or type.  I’ve seen six months to a year or never, as the average time to add new data to a Data Warehouse. The resolution of this problem is not to create a Data Lake, but to work on improving the process associated with the Data Warehouse.

However, before the request to add new data to the Data Warehouse or to a Data Lake, someone needs to determine that this data is needed by the organization for some business purpose.  Let’s call this person the “Data Scientist”.  The job of the Data Scientist is to create analytical models and insight for the organization using large amounts of historical data.  During the process of creating these models the Data Scientist will want to include data that is not currently available to current data consumers in the organization, to determine if this additional data is helpful in creating business insight.  This additional data may or may not be useful but the Data Scientist will want to try lots of different data in their models to make this assessment.  Data Scientists in the organization need a sandbox (possibly a segregated area of the Data Lake) for their data and advanced approval to access most or all of the data in the organization plus approval to integrate data from external data sources into their sandbox environment.  IT should not be delaying this important work with some misguided idea that they need to pre-analyze the data in question or document the data unless requested by the Data Scientist. 

Once the Data Scientist, along with their business sponsors, have determined that a new data source needs to be added to the Data Warehouse, Data Lake, or other operational system, then the optimized process of adding the new data to the application can begin, which will include adding the data to the Data Catalog and Inventory.

Read More

When to use NoSQL Databases

When to use NoSQL Databases? – Systems of Experience

By April Reeve

The basic question that arises when first learning about NoSQL databases is: “when would I use these?” and there really is a very simple answer: “probably in developing web applications and internet sites”.

Now, obviously, the full answer is much more complex and is determined by a broad range of factors including what development tools are currently available in your organization, the skills of your support and development resources, and the specific needs and functionality of the applications being developed. Also, the appropriate use of NoSQL databases isn’t limited to web applications.  However, many of these databases were created because of limitations in relational databases in supporting the emerging area (at that time) of web applications development. 

One way of viewing the differentiation between appropriate use of NoSQL versus relational databases, is that relational databases are most appropriate for “systems of record” or applications that store the definitive view of a piece of information for an organization that may be updated by multiple sources, such as a financial balance or the master information about a customer, focused on the accuracy of the information. Relational databases are perfect for transaction processing systems with sophisticated record locking capabilities that ensure the integrity of data that might be updated by multiple users, but include a lot of overhead involved in those features that are not necessary for applications focused on reading data.

NoSQL databases may be better at supporting “systems of experience” or the presentation of information quickly and easily, focused on the best experience for the user. NoSQL databases are better for analyzing large volumes of distributed data and managing very large volumes of on-line users while presenting data extremely quickly.

In addition to web applications, NoSQL databases are frequently used for advanced analytics applications, which are focused on accessing large amounts of potentially physically distributed data.   Most organizations are building modern analytics environments using NoSQL solutions such as Hadoop, key value stores, and document databases. Specialized databases such as graph databases are especially good at analyzing data relationships, such as how close connections are between people, which is a similar problem to logistical analysis of how most efficiently to deliver supplies.

So, in summary, NoSQL databases would probably be a good choice for “systems of experience”: web applications and other systems that read very large volumes of data or manage very large numbers of on-line users.

Read More