This Is Not Your Dad's Data Warehouse: Optimizing Your Data Architecture
At the peak of the 1980s pop culture boom, a new idea was emerging in the field of computer science. The idea of a single storage location for business data came from IBM™ researchers Barry Devlin and Paul Murphy, who proposed the concept of the data warehouse. The purpose of this storage was to improve the ability of businesses to make data-grounded decisions. They called their idea a "Business Data Warehouse." During the 90s and 2000s, we saw the rise of data warehousing solutions by IBM (DB2), Oracle™, and Microsoft™ (SQL Server). As the amount of data being ingested continued to grow as well as the need to store disparate data, the traditional warehouse was strained. To answer those problems, we fast forward twenty years to 2010 when James Dixon, founder and former CTO of Pentaho, coined the "Data Lake."
The first data lakes were implemented in on-premise storage and eventually into custom warehouses. These were far better solutions than the original data warehouses used just several years ago. However, this still came with the same challenge of forecasting storage needs often up to a year in advance. Questions often asked were, ”If the company does well, what kind of storage will I need? What if it does extremely well? What if I’m completely wrong and the company does extremely poorly and I’m sitting on a huge deteriorating asset?” In other words, “Will I be fired if the company does too well or too poorly?” In addition, the monolith solution still did not provide the required customization to fit many institutional needs. Usher in our modern times with cloud providers coming to the rescue. Cloud providers delivered two significant improvements with both computing and storage virtualization. Firstly, through economies of scale, they could drive down the price of both computing and storage. Just as important, they could also autoscale to whatever capacity is needed to house the data. When asked which data to save, institutions would answer, "Let’s save it all. Who knows what we will need.”
“We have an enormous amount of data, but it isn’t actionable or even meaningful.”
Garbage in, data swamp out.
With cheap storage being available, many institutions save all data they generate, even when they are unsure how it relates or even if it will ever be useful. The term “data swamp” was coined to refer to the enormous amount of data being stored without regard to its organization and usefulness. An institution may ask itself, “I’m paying to store all of this ‘useful’ data, but where is the return on investment? Why are my key business metrics needed still unavailable? Why do members of my organization still question the data? What is missing?”
With the storage of such a massive amount of data, data engineers were tasked with becoming experts at ingesting, storing, and transforming the data, while also being experts in the domain the data was sourced from. This knowledge was necessary so the data could be mined, modeled, and visualized effectively. Due to the number of sources, this was an enormous task for even the most skilled data engineering professionals. The data mesh is meant to solve this dilemma.
A data mesh is a collection of “data products” sharing datasets under an umbrella of data governance. A vital aspect of the data mesh is that the engineers responsible for the quality of data being transmitted are the same engineers who generated the data. This model ensures that those who understand the data best are the ones helping others derive meaning from it. This is an excellent solution for the engineering and technical employees, but what about the non-technical users?
Now that we have meaningful data, are we done? The data engineers, scientists, and analysts are happy, (well, happier), but what about the end consumers? How can they get their questions answered without needing an intermediary? What tools are available to help them?
The next iteration of architecture is the “data semantic layer”. Through the semantic layer, large data graphs are created, giving meaning to data. The generated "triples", (a subject, a predicate, and an object for example, “Rob viewed his report”), can be used to lower the required technical knowledge needed by the end consumers, allowing individuals to develop reports and consume data using their natural language. It can be used to empower the end consumers to use data that matters most to them. Is generating this layer easy? Not quite, but it is getting easier as modern tools make it more intuitive to model datasets and provide the needed tagging. This has been expedited using the hand of data governance that was adopted during the implementation of the data mesh.
Every iteration of data architecture targets a specific segment of the organization in order to improve their relationship with data. Data Warehouses allowed decision makers the ability to make better-informed decisions from captured data. Data Lakes expanded the capability of that storage and allowed engineers to capture much more data. Data Meshes created an architecture to ensure the data was useful. Finally, we are entering the age of the Semantic Layer. The Semantic Layer will liberate the data and provide meaningful, actionable insights not just to the highly technical engineers, but to all data consumers.
Unicon's focus for 2022 is helping our clients reach a point of “optimization.” As discussed in this article, the evolution of data solutions have come a long way. Unicon partners with our clients to understand where they are in their data journey and help them optimize their solutions. Data Solutions Optimization will look different for each client. Our goal is to transition folks from just having data to actually using data to make decisions. Data is powerful but if you can’t get it, don’t trust it, or can’t see it, it’s meaningless.
We’d love to hear your thoughts on the latest data architecture trends. What do you think of the new capabilities offered by the modern data stack? Do you see useful ways to apply it to your organization? In the coming months, we will share more of our thoughts on the impact of the modern data stack, and the evolution of the modern data stack’s tooling in particular.