There have been many articles written and talks given over the last several years on abandoning the Enterprise Data Warehouse (EDW) in favor of an Enterprise Data Lake with some passionately promoting the idea and others just as passionately denying that this is achievable. In this article, I would like to take a more pragmatic approach to the case and try and lay down a process that enterprises should consider for a data management architecture.
The focus is on data lakes for enterprises, referred to as Enterprise Data Lake to distinguish it from data lakes created by internet, ad-tech or other technology companies that have different types of data and access requirements.
The Enterprise Data Warehouse
The much reviled and beleaguered Data Warehouse has been the mainstay of enterprises for over 20 years supporting business reports, dashboards and allowing analysts to understand how the business is functioning. Data Warehouses when built right provide robust security, audit and governance which is critical – especially with the increasing cyber-hacks today.
Alas – many data warehouse projects are so complex, they are never finished! Further, the strict, hierarchical governance that many IT departments created around the warehouse caused lots of frustration as business analysts and researchers cannot explore the data freely.
The Hadoop Phenomenon
When Hadoop entered the mainstream, the big attraction for business analysts and data scientists was the ability to store and access data outside the restrictive bounds of IT! This raised the exciting possibility of finding new insights into business operations, optimizing spend and finding new revenue streams.
Defining the Enterprise Data Lake
James Dixon coined the term Data Lake in 2010 to mean data flowing from a single source with the data being stored in its natural state. We have come some ways from that definition and the most common definition of a Data Lake today is a data repository for many different types and sources of data, be they structured or unstructured, internal or external, to facilitate different ways of accessing and analyzing the data. The Data Lake is built on Hadoop with the data stored in HDFS across a cluster of systems.
The Data Lake must have the following characteristics:
- It must collect and store data from one or more sources in its original, raw form and optionally, its various processed forms.
- It must allow flexible access to the data from different applications; for example, structured access to tables and columns as well as unstructured access to files.
- Entity and transaction data must have strong governance defined to prevent the lake from becoming a swamp.
Let’s dig into the details. Read more