Designing the Data Model: Where to Start?

Building a Modern Data Platform in Azure from scratch can be very fun, and it’s rewarding to see the resources and code fit together to form an effective reporting solution for a client. While I’d love to start every project by deploying Azure Resources and fitting the pieces together right away, it’s important to understand the data model for the business first. A poorly defined data model can negate the benefits of an elegantly designed solution leading to poor refresh times, data swamp (hard to find or overwhelming amount of data), and altogether useless reporting for end users. So in reality, there must be a lot of up front work that happens in collaboration with the business to define the model before beginning the actual deployment. Even with the rise of the Lakehouse architecture, Star Schemas are still leveraged to ensure performance and easily derive business insights. Here are a few guidelines to help you start designing the dimensional data model for a business using a Star Schema.

  1. Hold Collaborative Discovery Sessions

    These Discovery Workshops are often called “Modelstorming” sessions and are essential to a successful design. For a solution dedicated to a single business unit with a handful of tables, you might be able to get the work done in about three two-hour sessions. You may need more meetings depending on the size of solution and the amount of business processes it touches.

    The goal of these sessions is to understand the current state of any existing databases and reports, determine the business processes that are tracked in these reports, and define the metrics that are captured by these business events. These metrics will eventually be translated to Facts in the data model.

    It is important to involve business users in the exercise, not just IT. These users have in depth knowledge about the functions they support and the insights they need from the data, so their buy-in is crucial. It would also be helpful to include any Data Governance personnel to get their buy-in as well. The sessions are meant to be highly interactive, and while there is no substitute for in-person collaboration, today’s virtual meetings can be just as productive.

  2. Trust the (Business) Process

    There are a couple patterns you could follow to guide your discovery and design sessions. Each of these patterns starts with defining a specific Business Process. These are business events such as taking an order, processing a claim, seeing a patient, etc. If you are struggling to define a process, start with the question: who does what?

    The Agile Data Warehouse Design framework defines events using the BEAM* (Business Event Analysis & Modeling - Star) method. This method employs useful diagrams to tell a “story” of a business process using the 7 W’s (who, what, when, where, why, how, & how many). BEAM* stories provide a familiar way for business users to describe events which can be easily translated into Facts & Dimensions. In the example below, a patient makes a telehealth visit at the Main Street Clinic. This story is tracked in a BEAM* table:

You can see we’ve already specified some measurements and descriptors for this business process just by filling in the table. Even some of the required database fields have been defined. As we repeat this exercise to include more business processes and handle nuances for individual events, the data model expands intuitively.

The more traditional Kimball Method uses a 4-Step method for designing the data model, but both methods have many overlapping qualities. There are also common questions that should be asked to appropriately define tables regardless of the method you choose to use.

  1. Select Business Process

  2. Declare Grain

    E.g. do you want to see expenses at the line-item level, or the account level? Start with the lowest possible (atomic) grain. Ensure all tables for a single process adhere to this grain.

  3. Identify Dimensions

    Describe the process using the 7 W’s to give context to the events and measurements. What are common attributes that the users want to filter or report on? Can the values change over time? What is the relationship to the Fact records? Can a single event be associated with multiple attributes?

  4. Identify Facts

    These are the numeric measurements or metrics. What is the process measuring? Do users need to see a snapshot of data at a point in time?

In addition to the business needs, it’s important to consider the availability of the data while designing the dimensional model. While the business may want to monitor the performance of a particular measurement, it’s of no use to anyone if the necessary source data is not available. For that reason, the target and source of each field should be tracked.

3. Track Your Findings

Besides BEAM* Tables & Stories (should you choose to use them), there are other useful ways of documenting the discovery results. The most familiar are the Star Schema and Bus Matrix. These are useful assets that can be referenced throughout the deployment of the Data Platform as well as future report design or ETL enhancement.

  • Star Schema

    This diagram represents the entirety of your data model in a visual format. The name refers to the way the Dimensions branch off the Fact table like the points of a star. Star schemas are useful for large and small datasets alike, as they minimize the number of join operations and thereby improve performance.

  • Bus Matrix

This is another way of representing the Star Schema in tabular form. The Bus Matrix can give an overview of the relationships in a data model at a higher table-level grain, rather than going down to the individual field.

  • Source-to-Target Mapping

    This document is a must-have for implementing the ETL. It contains all of the available source data and how it will eventually be used to create the target Facts & Dimensions in the data model. This documentation becomes more important in the common scenario in which multiple sources are used to create single target entities.

4. Listen

The final piece of advice is simply to listen to the customer. Try to tune into comments that the stakeholders might make about existing reports or nice-to-haves. Do people consistently bring up KPIs or metrics that are important to the organization? Is there a “gold-standard” report or dashboard that they can’t do without? Take note of pain points, such as filters they wish they could have or other ways to make their analysis easier. It is somewhat basic advice, but all of this information is extremely valuable when trying to transform data into insights using a Dimensional Model.

Previous
Previous

Synapse Serverless SQL Vs. Lake Database