Data Catalogs and the Generative AI Wave –

By neub9
10 Min Read

The “Tide” of LLMs Has Reached Our Shores

As a career data professional and having spent most of the last 10 years with Collibra, Alation, and now as an independent, I’ve come to a somewhat painful, yet exciting realization that large language models (LLM) will completely disrupt and displace data cataloging as we know it! That may seem like a wild assertion, but our corner of the world — the world of metadata, governance, stewardship, data quality, etc. — will not be “different” and remain untouched. In fact, it will be radically changed.

Revisiting the Data Catalog Value Proposition

The central purpose of a data catalog has been to act as an authoritative system of reference and knowledge base for data and data-related assets. The core value proposition has been to increase the productivity of a broad set of actors by sharing contextual knowledge (metadata) about assets, how assets are related, and how the assets are permitted to be used. These actors include everyone including business leadership, analysts, data engineers, data scientists, and risk professionals. In order to achieve this core value proposition, catalogs must be continuously populated, curated, and most importantly adopted.

How Should We Grade the Success of Data Catalogs?

We have seen several generations of catalogs over the past 20 years. First, there was a generation of very technical metadata-oriented platforms, then a generation of data governance-centric platforms. More recently, we have seen analyst and data-consumer-oriented catalog platforms. From a big picture perspective, great strides have been made to fulfill the vision of a catalog serving as a core knowledge base that supports leaders trying to embed the use of data and analytics into their culture. But technological evolution has not equaled overwhelming success or the emergence of a formulaic approach. Achieving broad adoption across an enterprise is still quite rare and the means of achieving it is a black art of human physiology, world class communication, a heavy training orientation, and a healthy dose of senior leadership inspiring and driving a clear vision.

In part, it’s because enterprises are still moving up the data and analytics maturity curve. That’s fine and normal, but it’s also because catalogs, as good as they are, still don’t make it easy enough to find, understand, explore, trust, and govern data. My simple litmus test for “easy” is being able to open a file, spreadsheet, or report and be offered insight related to the origin of its data, business definitions of terms and metrics, an assessment of its trustworthiness, instructions for how it can and cannot share it, etc. Catalogs get a D+ based on this test because they require people, who are trying to do their day jobs, to switch context into the catalog to search/navigate to the data assets of interest. For most people, it’s still far easier to pop a quick question in a chat to a colleague and let the human “network” produce an answer.

I will add that data catalogs do grade out much higher, probably B-, for more technical and data-oriented roles such as data engineers, data analysts, and data scientists. That’s because they don’t see catalogs as friction, but as an accelerant for gaining deeper knowledge.

Catalog Front-Ends and Back-Ends

I am going to take a moment to decompose elementary capabilities of modern catalogs. This is important to understanding how generative AI will eat its lunch. The modern catalog platform is an interesting beast and has what I think of as a multiple personality disorder. Its first personality is one of a collector and ‘stitcher’ of assets. They continuously connect to and collect metadata from a vast array of data assets that exist in data stores, reporting systems, and applications across cloud, on-premise, and hybrid environments. In addition, they sport some metadata augmentation capabilities such as classifiers, attempting to identify lineage, sensitive asset identification, data quality measurement, etc. All this is maintained in their internal data structure.

Its second personality is that of an application that provides “views” of the assets in ways that drive productivity for different roles, as described above. This commonly includes capability such as searching, tagging, chatting, reviewing, approving, etc. Basically, things that are close to what we expect from consumer-grade social media apps but for data. The third personality, which is less well developed, is as an active participant in the enforcement of governance policies. This involves maintaining policy definitions and then acting as a system of record for granting and restricting access to data across a broad scope of systems. This is less well developed because, frankly, it’s hard to unify technologies owned by vendors who all want to be the center of the universe.

Metadata as Language

All the metadata a catalog collects and stores — which includes data characteristics, profiles, classifications, usage, popularity and its relationship to all kinds of data-related assets like reports, metrics, terms, policies, etc. — can be easily expressed as language. That might seem like a very strange assertion, so I offer a simple scenario: Imagine we ingested Tableau reports into a catalog and they include a report called “Payment Forecast Report” for finance. Pretend it was tagged by the supply chain steward as also being important to its domain. Also assume that we have ingested tables from the data lake (the source of the report data), NetSuite (the originating source), and Azure Data Factory (the pipeline that move the data). Finally, let’s assume that some of the catalogs intelligent augmentation capabilities are being used and all the assets have been classified, scanned for the potential of being sensitive and governance sharing and usage policies assigned. After being notified, the finance business steward associates key metric descriptions for supplier payment thresholds and financial terms to the report and tags it as being authoritative. All of this is now tucked into the catalog data store waiting for someone to consume it using the catalog’s conventional and traditional user interface’s search and navigation capabilities.

Now, consider how the same metadata might be expressed as language: “Our finance department’s accounts payable analysts and supply chain organization use a Tableau report called ‘Payment Forecast Report’ to understand accounts payable requirements for cash on hand to pay suppliers. The report is created using both straight-line and moving average statistic methods. The source of the data is the payment history fact table, time, and supplier tables in the accounts payable schema in the lake house. Those tables are populated from the NetSuite application using Azure Data Factory Pipelines. The report is used monthly and has been certified to comply with our data quality standards.”

As you can clearly see, all the necessary context is present to create this narrative version. But why would we want to do that? Why would we want to express everything we know about our data assets as narrative? The obvious answer is to increase accessibility and reduce the friction I described above. And how that happens is by unleashing that knowledge to the enterprise using a large language model.

Intelligent Data Assistants

Microsoft announced Copilot for its virtual agents in November. OpenAI launched its private GPT models in November and Google announced its AI Studio for its Gemini LLM and Bard chat in early December. All of these are lowering the barrier for the creation of specialized chat driven assistants/agents. They are also promising to maintain a secure and private barrier between an enterprise’s intellectual property and what is consumed by their public-facing LLMs. The clear opportunity is for the metadata being collected in the catalog to be converted to narrative text and then consumed by the private LLMs that sit behind an intelligent data assistant chat interface. The experience for users will be revolutionary. Imagine having a chat with a digital expert about a report’s origin, meaning, provenance, trustworthiness, and data quality. 

Extend that thought to popular topics of glossaries and terms. With digital assistance, users can simply ask what terms mean, ask for suggested terms, and discuss term conflicts and overlaps. That may sound like imaginary, rocket science, but it’s already starting to happen. For instance, a user can already drop a report PDF or spreadsheet into a chat and ask for help analyzing it. If the…

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *