What is a Data Lake Anyway?
At Unicon, we help many institutions and enterprises develop and implement their data analytics strategy. A question we get often is “what is a data lake?,” and “how does it differ from a data warehouse?”
At the conclusion of this blog you should know what a data lake is, how it differs from a data warehouse, and how how each fits into your overall data strategy.
A Short History
Back in the 1990s and early 2000s, the term data warehouse was used to describe a centralized relational database that stored business event data from disparate sources to support the creation of analytics and reports. As the amount of data stored in data warehouses grew, query times grew as well, so very powerful and expensive databases such as Teradata and Oracle RAC were introduced to handle the load.
When the world wide web exploded, enterprises now potentially had millions of users generating millions of business events, and even the most powerful data warehouse products couldn't keep up with the load. Creative new ways were needed to store all that data; thus was born the Apache Hadoop Distributed File System (HDFS). This distributed file system architecture allowed terabytes of raw business event data to be stored on inexpensive commodity hardware.
Because it was no longer practical to run queries on the massive amounts of raw data in HDFS, a technique called Map Reduce was developed to allow the data to be summarized and reduced into data sets small enough for meaningful queries to be run. These reduced data sets were usually stored in a relational database and came to be known as “data marts”
The term “data lake” was coined to describe the raw enterprise data originally stored in HDFS and other hosted distributed file systems, and later stored in cloud-based file systems such as Amazon Web Services S3 and Google Cloud Storage.
The Problem with the Data-Warehouse-Only Strategy
If you want to skip implementing a data lake and store all your institutional data directly in a relational data warehouse, you may face the following challenges:
- It is very difficult to model raw unstructured data in a relational database table, so you have to pick and choose which data elements you want to store to support analytics. If a future analytics use case requires an element not in your model, it is very difficult to go back and add it.
- Your query times may be too long based on the amount of data in your tables.
- Databases can have a much greater operational cost than a data lake.
What's so Great About a Data Lake?
The main benefit of a data lake is flexibility. You can store any type of data in any format such as CSV, JSON, XML, click stream data, log data, you name it. And because data lake storage is cheap, you can collect all you want. When a particular analytics use case comes up, you have access to all the data in your institution, and you can transform it into a data mart tailor made for that use case.
Data lakes also break down data silos. A data lake contains all the institutional data--through proper data governance, you can make that data available to anyone in any department.
Data lakes also make it easy to derive new value from your data. In recent years, machine learning technology has developed to sift through your data lake and find valuable patterns you may not have anticipated.
The Evolution of the Data Warehouse
For institutions that implement data lakes, the term “data warehouse” has evolved to describe the database server used to store the data marts derived from the data lake via map reduce and other techniques. The data warehouse and its marts support your analytics tools and dashboards.
Putting the Pieces Together
The following diagram shows how the data lake and data warehouse fit within an institution's typical data strategy.
Hopefully you now know what a data lake is, how it differs from a data warehouse, and how you can use both in your overall data strategy to gain maximum benefit.
Unicon has helped many institutions define and implement their data strategy and is ready to help!