Practical Data Dictionary Design

A data dictionary fully describes a dataset; it functions as a reference point to understand where a given data set came from, what it describes, how it got there, as well as nuances and potential pitfalls. In this post you will learn practical data dictionary design, how its maintenance is crucial, and how an enterprise-wide usage and understanding of one contributes to a healthy, data-driven business culture.

What Is a Data Dictionary?

When data is collected and stored, properly defining each field’s meaning and purpose is the job of a comprehensive data dictionary. It’s a tool that simultaneously requires a low-level amount of work while yielding a high-level amount of value. Yet, of all things data related, it is often the entity that receives the least amount of attention, sometimes left as a last step in a data program or not implemented at all.

A data dictionary labels and describes how an organization collects, transforms, and stores its data; without one, data may seem vague or mysterious as end users observe data without comprehending it. Each data field has an important role to play; clearly defining that role is one of the most pragmatic steps an organization can adopt into their data program.

When implementing a data dictionary, it’s ideal to house it as close to the data set as possible, providing added ease of accessibility. In other words, while running a query, ends users shouldn’t also be required to go on a hunting expedition for definitions. Keeping the data dictionary within the data warehouse or near the raw data is an optimal arrangement, with the perfect data dictionary being one that furnishes a high-level description of every single field in your data as well as the ability to learn more as needed.

Why is Data Definition Essential?

As mentioned above, and in our overview of data modeling, database tables without definitions are often counterintuitive (at best). Clarity is key here, whether it’s about marketing, web traffic, an email campaign, etc., the goal is exposing clean, raw data. It’s imperative to any successful data model that the definitions for the terms used are clear, concise, and uniform, and that any ambiguity when labelling and defining terms has been removed.

A data dictionary addresses key questions with respect to the dimensions of the data, and understanding its construction often starts with how a metric or piece of data fits into a business story. Initially, this starts with naming standards for the data elements — this must be consistent, and irrespective of which team uses the data and to what end. From there, clear definitions of purpose for each field must follow. The ultimate ideal, when people interact with the data, is that a pop up appears when hovering over the field, or embedding a direct link to the data itself, steering the end user to everything they may want to know about that field.

A hypothetical data quality issue provides an apt example: you’re tracking usage of a key feature of a mobile application, and for three days the instrumentation on that feature failed, resulting in zero usage data. Now that three-day gap will forever be within your data whenever anyone exams a timeframe that includes that gap. Without proper documentation keeping everyone informed of an anomaly or outlier in the data, your numbers could look strange. And while some documentation will describe data origin more so than a strict definition of what constitutes a data metric or field, it’s important for everyone within the organization to understand a fundamental principle about data collection: sometimes instrumentation breaks or falls out of calibration. On top of that, the metric calculations can (and do) change as business logic changes.

When Is It Time for a Data Dictionary?

Start at the time of modeling the data. When building a data model, define what each field means, how it’s calculated, and determine both who is responsible for that field as well as who oversees changes affecting that field. The minimal amount of information about your data is important at the time you model the data (think something you can fit into a few sentences). A straightforward Minimum Viable Data Dictionary is presented below.

One key component should remain front and center when setting up — and maintaining — a data dictionary: Questions are good. Not asking questions is dangerous, in that it facilitates making assumptions. This has the potential to not only misread analytics, but also paves the way for adopting an assuming posture into enterprise culture itself. For example, while one field might describe the volume of traffic that each campaign brings to a website, a second, equally important field describes the number of conversions for each campaign. Questioning why or if an organization relies too heavily on the former might expose the power both actually have when applied in concert.

Data Dictionary Roles and Responsibilities

Define Ownership

When developing an organization-wide data dictionary, integrate common data elements across the entire institution to ensure consistency, as consistency reinforces the objective: quality data interpretation. Should the data dictionary be owned by the BI team? They are at least the driving force behind it, and should monitor, and implement, all changes. Whoever is deemed responsible, it is essential that access is provided to that individual or team, as this shows — and promotes — data leadership.

So how is the data dictionary best created, both organizationally and practically?

How to Create a Data Dictionary WITH a Data Team in Place

It’s usually up to a data team to provide the infrastructure for a data dictionary. Something as common as a wiki page, or Google or Excel spreadsheet, with a compiled list of terms that can then be defined. With the infrastructure started, work with the different teams throughout the organization that populate the data dictionary, and achieve consensus on whether they should actually own the data and the data definitions (this may be desirable, as it is they who understand their aspect of the business, and best understand what the data is measuring). For example, a product usage metric should be defined by the product manager; likewise, with a marketing campaign dimension, such as a UTM parameter, the marketing team should be defining what that field means and how it’s used. Any inherent nuances to specific metrics will be best discerned and addressed by teams most familiar with those metrics. Seek out their assistance in locating any missing definitions or refining definitions, remembering that uniformity is key — focusing on how a metric or field should be defined, and making sure everyone across all teams understands and, if applicable, signs off on what that definition is. Structured this way, the entire company can leverage a larger framework, which is a good place to start.

How to Create a Data Dictionary WITHOUT a Data Team

And what to do in the absence of a data team? It’s a logical question, and in a smaller organization this responsibility may fall on one person’s shoulders. A good three-pronged method for initiating the adoption of a data dictionary from the ground up might be:

Build a prototype: This could be as simple as a Google spreadsheet that lists all the fields for the most important reporting tables, including the Minimum Viable Data Dictionary elements listed below.
Iterate: The goal is community contribution to this documentation. Open up a chat channel (in Slack or otherwise), soliciting feedback to keep the conversation going.
Evangelize!: Pull the community together around this new standard and request their input. Remind critical stakeholders of the importance of this standard with stories, as stories are highly effective. Pro tip: describe the pain of vague data definitions and the resultant problems they have caused in the past.

Maintenance

With metrics in place, clearly defined, and ready to track performance, the data dictionary is off and running. But who is responsible for upkeep, and how is this maintenance managed? Organizations of all sizes will have multiple people or teams touching all this data, so distributing responsibility as much as possible throughout the organization helps keep everyone engaged. This fosters self-reliant participation, and a “self-service” way of taking action: if the description of a field has a contact, that individual can be alerted with questions, a need to update, etc. This is responsive, efficient, and pragmatic.

Further iteration will reveal the need for expanding, editing, and even changing data field values, particularly as business needs themselves change, and here pragmatism is again front and center. Any changes or new definitions should be generated by the affiliate business team (sales, marketing, etc., and consensus achieved whenever the same metric is exposed to a plurality of teams), yet executed with help from the data team, both in implementation and communication. Keep the organization appraised of the upcoming change; if applicable, provide examples of both the old and new definitions for the metric, describing what possible changes in data to expect, and be sure to document and log the change in the data dictionary itself.

Data Dictionary Framework (Implementation of Best Practices)

1) Create the actual data dictionary documentation

Start by providing a template per field. These are all the fields that must be populated, working with the teams to build this into their daily process. A Minimum Viable Data Dictionary addresses key questions with respect to the dimensions of the data:

What is the field’s definition? What does the metric measure or the dimension describe?
How is the data collected? What instrumentation is used?
Are the values calculated or raw? What is the calculation or bucketing logic?
Who owns (oversees) data collection and quality? (An individual or team?)
What is the contact information for the data owner(s)?

2) Explicitly determine data owners and ways of accessing them

Contact is essential. Clearly delineate who owns the data, whether that means an individual or a team, and have the method of access baked into the data program. Knowing when to ask a question about the data dictionary is instructive; knowing where to ask a question is conducive to future iterations of the data model continuing to bear fruit.

Panoply integrates all your data sources and connects to any BI visualization tool
View Integrations

3) Deploy a community forum or Slack channel

A healthy compliment to any data dictionary is ability for the community to have a discussion around the data. Aside from the nonstop, automated data processing happening 24/7, people are also running queries at all hours of the night, on weekends, etc. Being able to communicate through a company forum brings the goal of data democratization that much closer, connecting everyone with questions, observations, and answers about the data.

It’s highly probable that utilizing Slack is already second nature to most. Its familiarity and immediacy dovetails well with someone seeking out where to go with a question, and it’s ideal for the company “champions” out there, eager to share their knowledge, and quickly. Much of what comes out of a community chat will be applied to the data dictionary, and it may be helpful to consider your data dictionary the “well-informed electorate” in the pursuit of data democratization.

4) Crowdsourced commentary and definition updates

As mentioned above, updates to the data dictionary will likely germinate within the business team that best understands the value of a particular data field, with the data team then implementing changes in how that metric is collected and quantified. However, in the quest for data truth, there may be added value obtained by simply crowdsourcing within the organization, as there is someone, somewhere within the company with additional insight that will actively engage if given the tools to do so.

Companies should view data as a type of currency. It’s as valuable as it is indispensable, and everyone across an enterprise can and should contribute to that value as part of their daily routine. Encourage participation with — and exposure of — the “tribal knowledge” within the company. This attitude lends itself to crowdsourcing information about your data. While it may be viewed as more high-risk than top-down governance (and therefore more democratic), it is certainly more high-benefit than the low-risk, low-yield approach of discouraging or partitioning community commentary.

Data Dictionary Template

In conclusion, essential components of any comprehensive data dictionary include:

1) Descriptions:

Clearly define each field, and who owns it (the BI team, a specific analyst, etc.), meaning with whom you speak to gain further clarity. Ideally, hovering over the field provides not only a definition but also a contact link as well. Be sure to publish the data dictionary as a document easily accessible by anyone within the organization.

2) Lineage:

Describe the data lifecycle, from where it originates to the steps it goes through before arriving at the data warehouse. Keep in mind there is more metadata about the data than there is data itself. End users should easily comprehend where data values come from.

3) Exceptions:

Anything that stands out as unusual must be recognized and recorded. Nuances or aberrations in the movement, characteristics, or quality of the data should be logged in as such, with a record kept within the data dictionary.

4) Changes:

Over time, fields will change. Some will be added while others dropped, and certain fields will see new definitions ascribed to them. Also, data corruption can occur with aging storage devices, sudden drops or spikes in power, etc., affecting data quality, and must be notated. Logging each change in the data over time helps ensure data consistency.

5) Past examples and sample queries:

Describe how fields have been populated in the past, with example queries revealing why a data dictionary should be a prioritized component of general business processes.

Start defining data!

The most important part of designing a data dictionary is getting started.

This article was also published on Panoply’s blog.