The Modern Data Stack Components
Over the last ten years, data workers have been watching a trend of productization in the data ecosystem as SaaS solutions have flooded into the market. This has been surprising to some (including me) as I once declared, “The war for data dominance is over, and the big players have already won.” New services have since rushed in to fill niches that have streamlined every aspect up and down the traditional data pipeline, creating a “data division of labor.” What started as a few additional choices have bloomed into a vast array of options from ingestion to consumption. This is a good thing, as individual services ensure interoperability is at the forefront of their services. The aggregation of these services is the “The Modern Data Stack.” While the stack continues to evolve (and consolidate in some instances), new pathways have emerged to transform how we look at data. For many in the industry, this evolution of how we interact with data revives memories of the transformation of IT departments during the early years of cloud computing.
What should we expect from the modern data stack?
- Ease of use - Engineers should not be required to take weeks of training to perform simple, most common use cases.
- Automation - The user should also expect a high level of automation, eliminating tedious tasks.
- Community Support - Broad industry adoption should be expected, as a lack of industry adoption is often a red flag, even for the most feature-rich services. Without a community supporting the service, the future of that service is in question.
One oddity in this trend of Lego-inspired services is that there is no consolidated billing. There is no central service aggregating your ingestion, storage, and observability services. Instead, each will be calculated, billed, and paid for separately, often on top of your cloud provider’s service.
What makes a service a candidate for the modern data stack?
We consider three components to qualify:
- The first qualifying aspect is that the service is managed. No more having to provision and configure your server or regularly apply updates. The expectation is that the hosting company will handle all those mundane tasks.
- The second qualification is that the service is cloud data warehouse-centric. Gone are the days of integrating your data warehouse with an external application being a labor-intensive project, as previously custom development was often required to build those connectors.
- The final qualification is that the service is operationally focused. In other words, this service performs as expected in a high-volume production environment.
What areas do we need to consider services?
Get the data and make it worthwhile:
How do we get the data from our applications into the pipeline? Moving data into the data pipeline is often the first step into the modern data stack. Tooling such as Fivetran, Talend, or Airbyte makes ingesting data from various sources extremely easy. No longer is data engineering required to develop all the needed connectors for each data source, nor are they required to continue updates when pipeline-breaking API modifications happen to particular SaaS applications.
How do we know what customers are doing? Customer event tracking has become a treasure trove of behavioral data. This data is captured and often fed into another storage such as Hubspot or even Salesforce. As a result, Segment is often chosen for this service, although additional competitors such as Snow Plow and Rudderstack are gaining traction.
How do we make the extracted data useful? Rarely is ingested data in the format needed to be helpful. Transforming the data is a vital aspect of any pipeline. Fields may need to be joined, split, renamed, exploded, or truncated, just a few of the possible operations. In addition, the data may need to be normalized.
Where do we store a large amount of data from various systems? How do we ensure that datasets from multiple sources can be joined to create insights? The central component of the data journey is the warehouse. The warehouse is the core component to store, organize, and retrieve rapidly to the analysts, data scientists, and general data consumers. Major cloud providers own a large portion of the warehousing pie, such as AWS’s Redshift and Google’s Big Query. Additional warehousing options such as Snowflake and Databricks are quickly gaining traction and are finding their niche among users of particular skill sets. The trend in the warehousing space is making data easy to model and use by allowing users to configure and query the warehouse using standard SQL. Different SQL variants are the dominant and often preferred interaction with these new high-powered warehouses.
Keep the pipeline healthy:
How do I ensure the pipeline is healthy? While somewhat the “new kid on the block,” data observability is becoming a core facet to ensure pipelines are healthy and running as expected. Data observability often provides dashboarding into the movement of data among systems. In addition, it can provide the field-level lineage necessary to observe potential problems with data pipelines and provide a level of visibility into affected downstream systems.
What to do with the data:
How do I measure and provide data-driven decisions to my stakeholders? There are numerous tools available for data analytics. The big gorilla Tableau used to be the only serious game in town. Its ability to model multi-dimensional data provided the functionality to generate previously unavailable insights without massive SQL manipulation. Additional tools are now available to compete with Tableau, including PowerBI, Looker, Mode, Thoughtspot, and Preset. Each tool comes with its pluses and minuses and contains its learning curve to derive desired insights. In addition, each tool allows analysts to create dynamic dashboards and visualizations, and provide a central location for data users to consume insights.
How do I predict outcomes from new data? Artificial intelligence and machine learning have been extremely hot this last decade. Machine learning has been essential to modern decision-making, allowing users to create statistical models to predict outcomes as new data flows into systems. This type of prediction has become the heart of modern-day marketing.
How do I ensure the data is both valuable and manageable? While collecting is important, collecting data without the rules of representing that data can be an expensive disaster. Governance plays a crucial role in ensuring data is high quality, accurate, and usable.
How do I get these derived insights back into my application? Now that the data pipeline has captured, stored, and made predictions on the data, reverse ETL provides the mechanism to push those insights back into your applications. It is often the final connection of the data “circle of life” as applications make decisions from the valuable insights created by the pipeline.
How do we make the data applicable to the business without an analyst? The semantic layer attempts to make data more intelligible and usable to the business community without the intermediary of analysts. This is accomplished by creating data with “guardrails” and familiar interfaces that make building and customizing reports both accessible and efficient.
These areas provide the backbone of a modern data stack. In addition, the services are often pieced together on top of your cloud provider to create an understandable, reliable solution. The great thing about the modern data stack is it is still evolving and will probably continue to evolve indefinitely. If one service does not meet your needs, there are often multiple service alternatives that may be a better fit. The framework is engineered in a way that future innovative solutions can be developed and plugged into the stack without large reengineering implications. Unicon is focused on continuing to explore and provide insights on each layer of the stack in future articles.
- This Is Not Your Dad's Data Warehouse: Optimizing Your Data Architecture