A modern approach to modeling your data that can reduce the complexity and cost of building your reporting data model
Data modeling has been around for the past several decades and has evolved to help organizations structure their data for storage and consumption. Data modeling techniques (like Kimball and Inmon) in the business intelligence space have played a pivotal role in how an organization approaches data gathering and sections it into data warehouses and data marts. This also aids enterprise reporting tools to access this data and deliver insights to end users.
As these enterprise systems were built, they started largely as on-premise databases, while reporting was limited to a canned reactionary look at the organization’s data. All the complexity was built into underlying data structures using database dependencies and complex ETL processes, and the reporting tool was simply a mechanism to deliver a static report with limited ad-hoc capability.
With the advent of the cloud and advancements in data storage, data processing, and reporting and dashboarding tools, organizations now need to reimagine a modern data model to take advantage of these new capabilities.
We’ll approach here a new data modeling concept that leverages modern-day reporting and analytics tool advancements to reduce cost, complexity, and maintenance of the data platform.
The foundational concept driving reporting data modeling for the past 20 years is the star schema, first proposed by Ralph Kimball in 1996. A deep dive on how the star schema works is available here, but the primary concept is that tables are categorized into dimensions and facts. Dimensions describe actual entities within your organization (customers, stores, employees, accounts, etc.) and facts describe events that occurred (sales transactions, G/L transactions, orders, etc.).
The star schema gets its name from how the tables in the data model are laid out in an entity relationship diagram (ERD). The fact tables are the centers of the various stars, and dimension tables form the spokes of the stars.
However, given the state of storage and reporting technologies when the star schema concept was born, significant focus was given to enforcing dependency and error checking within the ETL code responsible for loading the data model. Concepts like surrogate keys and load ordering were necessary to ensure referential integrity of the data, and reporting tools expected data to come with those pieces intact. This has led to the ETL becoming complex and monolithic, and errors tend to halt the entire load process. As a result, the ETL process has historically tended to be the majority cost component for reporting implementations (both in initial implementation and ongoing support costs).
Fortunately, with a slight modification to our approach, we can now greatly simplify our ETL processes while still retaining the spirit and the fundamentals of the star schema.
The loose dimensional model concept has us still leveraging the idea of facts and dimensions, but it eliminates the use of surrogate keys to tie everything together. Instead, all joins between dimensions and facts are done using natural business keys. When the business key consists of multiple source elements, we can link those elements together to form a single composite key.
This modification to the data modeling approach greatly simplifies our data load processes in the following ways:
With a much simpler ETL architecture, we are now able to reduce cost, timelines, and risks associated with the ETL build portion of our platform build. Those hours can now be shifted into more tangible business-focused value areas (such as the final reports or KPIs).
The appeal of this approach is that it works with most modern toolsets in the market today. If we use a modern reporting toolset that supports data modeling (such as Tableau or Power BI, for example), we can take advantage of our reporting tool’s capability to handle the data modeling complexity internally and remove that complexity from our ETL process.
Whether using a more traditional cloud/on-premise ETL tool with custom development (whether that’s Azure Data Factory, SSIS, Informatica PowerCenter, or any of the other various ETL tools in the market) or a more modern metadata-driven platform (such as West Monroe’s own Intellio® DataOps), the loose dimensional modeling approach can be a great fit.
What we propose in the loose dimensional model is not a shift of the foundational data modeling concepts that have served us well for more than 20 years. We are still leveraging fact and dimension tables in our final data model, and our requirements gathering approaches do not change as a result. We are also not recommending a specialized toolset that will require an additional investment to ramp up resource skills.
Instead, we propose a gradual shift in how we tie those tables together on the physical data model that we expose for reporting and analytics.
Since today’s reporting toolsets can handle the complexity of entity relationships internally, there’s no longer a need to tightly control those relationships in the ETL layer.
We can then still leverage the basics of dimensional modeling that have worked well for many years and keep risks low while also making a tweak to our technical implementation that can save in development costs and risks.