Data Lakehouse

Lakehouse paradigm is a relatively new architecture that combines the best concepts from data lake and data warehouse models. We at Y point have accumulated vast experience in matching organizational needs and vision to the most effective architecture in terms of performance and end user comfort in a cost-effective way. This article explains the features, components and logical architecture of a Lakehouse in simple terms.   

Lakehouse Features

  • Built to store both structured and unstructured data like a data lake
  • Enables structure and schema upon data lake, similar to that of a data warehouse
  • Cloud deployment provides the ability to scale storage and compute independently
  • Prevents vendor lock in by leveraging the following open source projects Apache Spark, MLFlow and Delta Lake
  • Stores centralized metadata catalog of all data objects, their versions, access control lists and organizational governance rules. These rules can be applied at logical Table / View / Attribute level as against physical file level

Lakehouse Components

As the adjacent picture shows, a Lakehouse is a sequence of layers. Bottom most layer is Data Lake, which is usually built on a cloud. Delta Lake is a set of API that can be treated as a part of Data Lake. Above that is a metadata, caching and Indexing layer. ETL pipelines read directly from the Data Lake and have access to the Metadata. A unified Metadata APIs layer provides a unified service to the consumer APIs namely SQL APIs and the APIs feeding to Data Science and Machine Learning applications. This section explains each of the components in this Lakehouse.

Apache Spark is an open source, faulttolerant, multi-workload distributed computed engine that supports distributed memory. It can run on a simple single node like a developer’s lap top or can be deployed on any clustered environment. ETL, Data engineering, Graph processing can be performed in a single data pipeline from disparate data sources in structured, unstructured, streaming formats. Consumers include Dashboards powered by both DW/BI features and ML capabilities. It supports Python, Scala, Java, R, SQL bindings so a developer can use any of these languages along with the libraries available in that language. 

MLFlow is an opensource platform to Manage Machine Learning lifecycle, including experimentation, reproducibility, deployment and a central model registry. It integrates well with Apache Spark. MLflow currently offers four components that streamline Machine Learning workflows. MLFlow Project and MLFlow model components are available to package project code or model code to be reproduced on different platforms. MLFlow Registry component is a centralized model store with accompanying UI and API to serve the model in different “flavours” (eg Rest API). MLFlow tracking component records each “run”, which is execution of some piece of code. 

Delta Lake is an open source framework or API that sit on top of Data lake and adds structure (Schema) to it. It also imparts ACID compliance and effective read/write capability to any Data Lake on any cloud and can work with any query engine. Additional features include data quality checks, Schema evolution, audit trail, time travel (Query as of a certain time in the past) and data compaction. To provide these features, Delta lake uses versioned Apache Parquet files to store data changes in the Data Lak and transaction logs. The files are called Delta Tables. Transaction log keeps recording all changes to the Data. These changes are written into a single file called checkpoint file frequently. A read operation constructs latest data from the previous copy and these delta changes.

Primary purpose of the metadata layer is to raise the abstraction level of underlying data. For example, A “Table” abstraction stores which data objects belong to a table. Such a logical abstraction is the key for many advantages of the Lakehouse. In addition to the schema, ACID compliance and mutating data objects, these abstractions

a table/row/attribute level. Centralized governance rules and audit trail are implemented in this layer. Indexing and caching are also added at this stage for performance improvements.

Get in Touch