In our first instalment, we journeyed through the evolution of data storage from traditional Datawarehouse to data lakes, leading to the emergence of the Lakehouse architecture. Now, let’s unpack the inner workings of the Data Lakehouse architecture, exploring the intricate layers and components that define a modern Lakehouse.

The architecture principles on which Data Lakehouse:

  1. Discipline at the core. Flexible at the edge – This means that layers of data storage approach needs to be structured in terms of data governance and management. However, layers of data transformation and consumption should be flexible. A good example would be merging the raw data from the data lake and the transformed data to create an ML model.
  2. De-couple Compute and Storage – Previous generations of data platforms used architectures with integrated storage and processing layers. With lakehouse, the approach is to separate the compute and storage. This means that we can add more storage without increasing compute capacity and vice versa. Think of this as the modern laptops – we can have an Intel i5 with 320 GB or a 1 TB storage or i7 with a similar storage configuration. This gives
  3. Focus on functionality rather than technology – The focus is here to support all types of use-cases, including BI, AI/ML, ETL and streaming and this in turn means a data lakehouse should be able to cater many people. Technology has been rapidly changing. Take the underlying data infrastructure – we have come a long way from relational database to non-relational database to on-prem Hadoop data lakes to cloud data lakes and now Data Lakehouse. Technology evolution has been rapid but the use-cases for data management and data consumptions are still the same. Therefore, focusing on the functionality is more important to accommodate technology evolution.
  4. Modular architecture – Modular architecture ensures that we can replace any part of the data platform without impacting the rest of the platform. Different services, based on their functionality, can be instantiated to use data as per the need.
  5. Perform active cataloguing – One of the major problems with Data lakes that with rapid ingestion of large volumes of data they would slowly turn into swamp. Cataloging is the key to preventing a data lake from becoming a data swamp and users are aware about the whereabouts of the data.

There are 5 major components of the Data Lakehouse:

  • Ingesting and processing the data
  • Storing and serving the data
  • Deriving insights from Lakehouse
  • Applying Data Governance
  • Applying Data security

In this blog we will discuss on the processes for ingesting and processing data and storing and serving the data.

Ingestion Layer: Blending Flexibility with Structure

The ingestion process is the one of the most important aspects of the overall architecture. In our experience, working with large volumes of data involves the following challenges in the Ingestion process:

  • Time Efficiency – Connecting to the sources and creating the manual mappings is a time-consuming activity that shifts focus from engineering activities to redundant activities.
  • Schema Changes – Even a small change in the upstream data schema can have a large negative impact on the Datawarehouse
  • Changing Schedules – Think about 1000s of inter-dependent data pipelines and change in 1 pipeline can have a ripple effect on enterprise’s business predictive capabilities. One organization faced massive changes in their customer acquisition machine learning model due to the ETL time shift
  • Managing streaming and batch data ingestion simultaneously.
  • Data Loss – One of the biggest problems we have seen while working with customers is the issue of data loss during data synching between 2 systems.
  • Duplicate Data – No doubt that re-running a failed job can load duplicate data if proper checks and balances are not in place.
  •  

So, as you would expect these challenges make Data ingestion a complicated process. It is in this regards that Lambda architecture pattern for data ingestion is suited for a Lakehouse architecture. The following diagram suggests how the Lambda Architecture works.

The Lambda Architecture is a deployment model for large volume data processing bringing the Batch processing and Streaming process together solving for real-time data processing as well as responding user queries. It consists of 3 layers – Batch, Streaming and Consumption layer. Let’s understand these 3 layers.

  • Batch Layer: The data is collected in a raw data store within a data lake, either through automated pulling or pushing methods. After data ingestion, a batch processing service starts that uses a distributed computing engine to process data efficiently and quickly. The processed data is then stored in two locations. Firstly, it’s stored in the processed data store within the data lake. Secondly, it’s made available to a component in the serving layer, which facilitates downstream access and use of the data.
  • Streaming Layer: The streaming data is ingested into the system as topics using an event publishing service. This raw data is simultaneously stored in the raw data store of the data lake. A stream processing service subscribes to these topics and processes the data in micro-batches, performing specific actions on either group of events or individual events of interest. Similar to the batch layer, the processed data is stored in two locations. It first goes to the processed data store within the data lake. Then, it’s made available to a component in the serving layer for downstream use and consumption.
  • Publish and Consumption Layer: The Publish and Consumption layer is used to serve the processed data to downstream consumers. We will discuss more about these layers in the next blog.

 

Storing and Serving the Data

Data storage is essential from both storage and performance perspective. Once data is ingested into the platform, it needs to be managed and stored correctly backed by a strategy that reduces unnecessary data duplication. In addition, utmost care needs to be taken on data access to ensure proper enterprise security standards. Once data is processed, it needs to be served to downstream applications and stakeholders. Each consumption pattern will require different technologies. 

The Data storage consists of 4 layers each serving different purposes.

Data storage lake house

  • Raw Zone: The initial storage area within the data Lakehouse, often referred to as the raw or bronze datastore, acts as a buffer that separates the incoming data sources from the core data Lakehouse. Data from various sources is stored here in original formats that are optimized for large-scale data operations, such as Avro, ORC, and XML. The organization of the data into fields and records in the raw datastore closely reflects that of the original data sources. For unstructured source data, the integrity of the format is maintained. For instance, if the original data is a CSV file, it is stored in the raw datastore as a .csv file.
  • Enrich Zone: After data is transferred to the raw datastore, it undergoes a series of transformations involving multiple steps to progressively refine the data in the Enrich or silver zone. For instance, a simple step to ensure all date fields are in the same format and the data might be subjected to purification, selection, consolidation, augmentation, and similar procedures. These transitional steps are beneficial for two main reasons:
    • If there’s a need to reinitiate processing tasks, these datasets provide a recovery point.
    • The transitional datastore serves as the foundation for creating the final, processed datastore. Leveraging specialized computational resources to advance data from the transitional to the final processed stage enhances efficiency.

    This approach to data management ensures that    the processing is tailored to the data’s current state.

  • Consume Zone: Data within the intermediate, or silver, datastore is compiled and transferred to the Consume, or gold, datastore. This final layer houses data that has been purified and compiled, poised for utilization. Subsequently, this processed data becomes accessible for analytical tasks including exploratory analysis, spontaneous querying, machine learning, among others, within the data analytics layer.
  • Archive Zone: The archived zone represents the terminal storage layer, catering to the long-term data retention needs. This is a cost-effective storage solutions and periodic archiving routines are established to systematically migrate data from the raw, silver, or gold datastores into this archival layer. This process is carefully timed to strike a balance between cost and performance, ensuring that only data which is not in active use is archived, thus optimizing the utility and financial outlay of the storage infrastructure. Top of Form

The common data formats used for storing the data in these zones are CSV, Parquet, JSON formats for structured and semi-structured data. Unstructured data such as videos (MP4), images (JPEG, TIFF, GIF), and audio (AVI /WAV).

Since a Lakehouse can support the use-cases of a Datawarehouse, Real-Time Data Services, and Data share services, we recommend leveraging different technologies for the consumption layer based on the use-cases. For Datawarehouse, SQL-based serving would be the best technology to meet the needs of the Business Intelligence and Artificial Intelligence use-cases. For Real-Time Data services, API-based serving using NoSQL format can be  used for such requirements. The concept of Data sharing is gaining traction as the curated data need to be shared with both internal and external partners. The Data sharing can be shared using APIs as well as Data Clean rooms.

Conclusion: The Technical Symphony of Lakehouses

The Lakehouse architecture is a symphony of various technical components, each playing a crucial role in managing and processing data effectively. For organizations looking to harness their data’s full potential, understanding these components is key to leveraging the power of a Lakehouse.

Midoffice Data stands at the forefront of this technological revolution, simplifying the complexity of Lakehouse architecture for businesses seeking agility, efficiency, and data-driven decision-making. We provide the end-to-end suite from data ingestion to storage and serving capabilities to meet the needs of enterprises complex data needs.

 

References:

Data Lakehouse in Action by Pradeep Menon

https://www.montecarlodata.com/blog-data-ingestion/

https://www.databricks.com/product/data-lakehouse

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *