Navigating Data Management: Warehouses, Lakes and Lakehouses

In today’s dynamic data management landscape, the terminology and concepts related to data storage and processing have become more intricate. Businesses face a significant challenge in efficiently handling the influx of data from diverse sources. In this article, I aim to unravel various approaches to data management, offering examples of tools for each concept and providing a roadmap of the modern data management landscape.

Database: The Foundation

Databases have long been the foundation of data management. They represent structured repositories for storing, organising, and retrieving data efficiently. Databases come in various types, that are usually (very) broadly split into relational databases and NoSQL databases, each tailored to specific data requirements and use cases.

When we’re talking about SQL solutions, we usually imagine normalised schemas and catering for OLTP usecases, though some NoSQL databases work well with denormalized data.

Key features of databases include:

Structured data storage. Databases excel at handling structured data, ensuring data integrity through predefined schemas.
Efficient row-level queries. Databases are optimised for row querying, and when the query is “correct”, the database can retrieve single or several records very quickly by utilising indices.
Simple deletion and updates of existing data. Databases usually don’t have a problem with updating or deleting a single row very quickly.

While databases are robust for managing structured data, they may face limitations when dealing with unstructured or semi-structured data, and they are not well suited to analytical queries that can read millions (or even billions) of rows at once. This led to the development of more specialised solutions like data warehouses and data lakes, which we will explore in the following sections.

There’s a huge selection of examples, I’d just mention PostgreSQL and MySQL as classic SQL options, and on the NoSQL side - MongoDB and Cassandra. NoSQL term itself is very broad and includes databases aimed at very different use cases.

Graphical representation of Database as stacks of disks

Data Warehouse: Structured Insights

Data warehouses have long been the cornerstone of data management. These structured repositories are designed for storing, managing, and analysing structured data, as well as providing a good performance for analytical queries. Data warehouses are characterised by their schema-on-write approach, meaning data is structured and might need to be transformed before it's loaded into the warehouse.

Key features of data warehouses include:

Structured data. Data warehouses are best suited for structured data, such as sales records, financial data, and customer information.
Schema-on-write. Data is carefully structured and transformed before being loaded into the warehouse. This ensures data quality and consistency, but also requires developers to write some code when integrating a new data source, or when an existing one changes its output.
Optimised for analytics. Data warehouses are designed for fast query performance, making them ideal for business intelligence and reporting.

However, data warehouses have limitations when it comes to handling unstructured or semi-structured data and real-time data processing.

Some most popular examples would include Snowflake, Amazon Redshift and Apache Hive.

Graphical representation of Data Warehouse as a Warehouse

Data Lake: A Flood of Possibilities

As organisations started to grapple with larger volumes of diverse data types coming from many sources, data lakes emerged as a complementary solution. A data lake is a storage repository that can hold vast amounts of raw data in its native format, whether structured, semi-structured, or unstructured.

Key features of Data Lakes include:

Raw data storage. Data lakes usually store data in its raw form, making it suitable for a wide range of data types - it can be both exported tables from relational databases, plain-text logs collected from multiple systems, and even binary data like images.
Schema-on-read. Data is structured and transformed when it's read, allowing for flexibility in data exploration and analysis.
Scalability. Data lakes can very easily scale horizontally to accommodate almost arbitrary data volumes.

Data lakes are excellent for storing big data, but they can become unwieldy (turning into infamous Data Swamps) without proper governance and data cataloguing. Usual definition of Data Lake doesn’t include any utility for data management, governance or querying - and some companies tried solving that by introducing concepts of “data lakehouse”.

Graphical representation of Data Lake as big body of water

Data Lakehouse: best of both worlds

Data Lakehouses represent a relatively recent innovation in the world of data management, aiming to bridge the gap between the versatility of data lakes and the structured processing capabilities of data warehouses. They combine the best of both worlds by offering a unified and organised storage infrastructure for structured and semi-structured data while supporting efficient analytical processing. Lakehouse allows traditional “warehouse style” analytics and querying, built on top of Data Lake.

Key features of Data Lakehouses include:

Still scaleable. As Lakehouses are built on top of Lakes, they still allow for high scalability and storing data in different formats.
Schema evolution. They allow for evolving schemas, so data can be ingested in its raw form and structured as needed.
Analytics-ready. Data lakehouses provide features for performing queries and data indexing, akin to data warehouses.

Examples of popular data lakehouse systems include Delta Lake (by Databricks) - open-source storage layer that offers ACID transactions and schema enforcement for data lakes, and Iceberg - open-source project that focuses on providing an efficient and transactional table format for data lakes, allowing users to manage large-scale data lakes with the same ease and reliability as data warehouses.

Data lakehouses are gaining traction as organisations seek to streamline their data architectures, reduce data silos, and enable real-time analytics while maintaining data governance. They represent a promising evolution in the ever-changing landscape of data storage and processing, addressing the challenges posed by the diverse and dynamic nature of modern data.

Graphical representation of Data Lakehouse as house on the sea bank

Data Mesh: data as a product

Concept of data mesh proposes a different way of thinking about data - as a product, provided and managed by (including quality, uptime, etc) respective teams. It might come in different forms - from a curated dataset to an API. Business units within the company then consume the data product on a self-service basis.

Data Mesh is a paradigm shift in data architecture that addresses the challenges posed by the increasing complexity and scale of data within organisations. It introduces a decentralised approach to data management, breaking away from the traditional centralised data warehouse model.

Key principles of Data Mesh include:

Domain-oriented ownership. Data is owned and managed by cross-functional domain teams, which are responsible for data quality, governance, and access.
Data as a product. Data is treated as a product, with clear ownership, documentation, and service-level agreements (SLAs) for data consumers.
Self-serve data platform. As teams are responsible for providing access to their data, it doesn’t mean that Data Engineers are not necessary - they need to create a platform that would make it easy for teams to share and discover data they need.
Federated compute. Data processing and analytics can now be performed close to where the data resides, reducing data movement and improving performance.

While Data Mesh is a relatively new concept, it has gained attention in the data management community as a way to address the challenges of data decentralisation and democratisation within large organisations. This approach won’t be for everybody though - in a smaller company one dedicated storage for everyone would be easier to set up and manage.

Conclusion - Combining Approaches

While I’ve tried presenting somewhat of a “timeline” here, with new tools and concepts emerging over time, it absolutely doesn’t mean that old approaches are deprecated and substituted. Organisations are adopting multiple approaches that combine the strengths of these technologies, and mitigate negatives.

One topic that wasn’t touched here at all is increasing use of ML tools for data management, automating tasks like data cleansing, quality monitoring and anomaly detection, and predictive analytics, making data more valuable and actionable.

Chashnikov.dev