Data Lakehouse platform offers a single platform from data storage to consumption of disparate data formats, different kinds of workloads and different types of consumers. All this can be done on a cost-effective, scalable cloud platform. Databricks company offers Lakehouse on multiple clouds e.g. AWS, Microsoft Azure and Google Cloud Platform. This article explains the layers, terminology, followed by the architecture. Please refer to this article for the concepts of a Lakehouse.
Databricks Lakehouse on AWS is a popular choice because it provides a platform that addresses all analytics and AI use cases with better price to performance ratio and simplicity. Especially customers who already have their data lakes on AWS get good benefits because Databricks Lakehouse adds to, but does not replace their existing Data lake.
Lakehouse paradigm is a relatively new architecture that combines the best concepts from data lake and data warehouse models. The scalability and flexibility of data lakes together with the reliability and performance of data warehouses. It has following layers:
As of September 2022, Databricks Lakehouse can be built on AWS Graviton 2-based Amazon Elastic Compute Cloud (Amazon EC2) instances. The custom Graviton processors in these EC2 instances work well with the high performance Databricks query engine Photon to deliver good price-performance.
Storage is built using Amazon Simple Storage Services (S3) due to their durability, availability and scalability. S3 storage provides elastic scalability where a single object can be in terabytes. S3 has different storage classes to fit different requirements. For example, Standard is for frequently accessed data and Intelligent Tiering is for automatically moving data to the most appropriate tier. S3 Glacier and S3 Glacier Deep archive are for infrequent access data.
Data is stored as “objects” in logical containers of data called “buckets”. Every object has a key associated with it. The key may have a directory like structure in it to implement structure. S3 can be managed via console or S3 API. Databricks stores most of the data in these objects. However, it also uses “block storage”, similar to traditional hard drives. Disk cache, operating system and libraries are some files that use block storage. Apache Spark uses blocks for efficient parallelization and data loading.
The term “Delta” refers to technologies related to or in the Delta Lake open source project. Delta Lake adds schema to data lake and allows for transactions, versioning and rollback capabilities. It introduces the logical abstraction of Tables, Views and attributes on data lake. A query optimizer “Delta engine” optimizes the performance of Apache Spark by pushing computations to the data. A Delta table has associated DeltaLogs, or Delta Lake transaction Log that saves all changes to that tables. Databricks supports “Delta Sharing”, an open protocol for secure data sharing across organizations regardless of the computing platforms they use .
Unity Catalog is a unified governance solution for data and AI assets in a Lakehouse. An S3 bucket is associated to a “metastore” as a managed storage. A metastore is the top-level container for metadata. The hierarchy contains Catalog, Schema and Table in that order. A Volume sits at the same level of Table and provides governance to non-tablular data. Both Tables and Volumes can be stored as external, sitting outside the Data Lake. Data permissions are given to users or groups by metastore admin or owner of the object (e.g. table, schema). Securable objects in Unity Catalog are hierarchical and privileges are inherited downward.
Apache Spark forms the processing layer. An Organization may choose to have some specialized products or libraries for custom processing. Purpose of this layer is to have the data pipelines that transform the source data into the form expected by the consumption layer.
The processing related assets like notebooks, dashboards and experiments are organized in a “workspace” into “folders”. A folder called “Repo” integrates Git using Databricks Repos. Git provides Source code and Version control.
As shown in the picture above, there are seven pillars under the layers described above to complete the Lakehouse. Purpose of each pillar is as given below:
Inter-operability, Usability, Reliability and Performance Efficiency are in-built in the product. The rest require the golden triangle of people, processes and technology for effective implementation. Data Governance, Security, Privacy and cost optimization are an intricate combination and together culminate in Operational excellence. From the technology point of view, Databricks provides the following levers to work with:
Databricks provides security features, such as single sign-on to configure string authentication. Access Control Lists (ACLs) are set up by the admins to control who can perform what actions on the objects under “Workspace”. An organization may have one or more “Workspaces”. Data in the Data Lake can be encrypted if required, or just some sensitive data if not all. To authenticate external data sources, Databricks provides a mechanism called “Databricks Secret Management”. Databricks also provides controls that help meet security requirements for many compliance standards, such as HIPAA and PCI. Auditing features are provided out of box to monitor user activity at the desired level of access. A “Security Analysis Tool” helps in analysing the data collected.
The following diagram depicts our approach towards a Data Lakehouse using Databricks on AWS. The grey box is a pre-built library from our R&D division to speed up development. This library contains standard clean up, validation, data engineering operations built on top of Apache Spark. We recommend the add-ons like Amazon Athena to complete the eco system.
Data Lakehouse is the logical next step in data storage and processing landscape. Databricks Lakehouse on AWS is the joint effort of Databricks and Amazon, both leaders in their own space. The architecture in itself is based on open source projects. This fact combined with data sharing protects an organization from long term vendor lock-in. The product is considered pricey but with careful design, architecture and development it can deliver good performance to cost ratio. Alternatives include a pure AWS Lake house formation, Databricks Lakehouse on other clouds or a pure open source implementation.