Service Oriented Architecture (SOA) ushered in software architecture becoming a set of loosely coupled, self-contained, interoperable, reusable “services”. Architects no longer need to stick to the stack provided by a single vendor and add custom code wherever required. Paradigm has shifted towards “composing” the architecture using these “software LEGO blocks” that work together to solve the problem on hand. Focus of the architect remains on the business problem on hand and matching requirements to the features provided by the available off-the shelf services. The same is applicable to Lakehouse architecture design too. An architect has to weigh in the available choices and make the best fit for his organizational requirements.
This article expands on the previous article Introduction to Lakehouse and dives into the logical architecture to the software landscape. Given the plethora of offerings available today, we at Y Point have chosen a subset of offerings at each layer as shown in the above picture. For example, giants like Oracle, IBM, Informatica have their own Lakehouse offerings. They become the default choice for organizations who have a long and happy relationship with these vendors and they need to look no further. If you are one of those organizations which prefer to put in efforts into making the right combination of choices, this article provides pointers to ponder.
At the lowest level, one requires a storage layer, or Data lake. The top three cloud service providers are considered as the candidates given that they comprise about 2/3rd of cloud market share. All the three are competitive and good fit for the task. Multi-cloud deployments have become common citizens in many organizational IT landscapes. They have their own key differentiators as discussed:
Amazon Web Services (AWS) is the first to enter and still is the leader, though the gap is closing in. With most availability points and services, AWS is a strong candidate. Spot instances can cut down costs up to 90% if they fit the requirements. Extensive data backup, wide variety of third-party offerings in AWS marketplace are added advantages.
Microsoft Azure is an excellent fit in Microsoft-centric IT infrastructure, mainly due to the familiarity, reuse of existing licenses, security and access control related process implementation. It has adequate availability points and services and the integration of on-premises and public cloud (hybrid cloud) is a significant plus.
Google Cloud Platform (GCP) is a relatively new but has its own strong niche area. Startups that leverage open–source extensively find a better partner in GCP. Multi-cloud and SMB focus are other significant key differentiators in favour of GCP.
Delta lake is a logical layer of software API that add ACID properties and incremental changes (mutability) among other things to the Data Lake. Several use cases can be implemented just by wrapping their existing cloud data lake with Delta Lake, especially if data object cataloguing and security are not stringent requirements. All options considered here provide ACID transactions, mutability, schema evolution and time travel. Other common features include Parquet file format, Python scripting support.
Delta Lake was developed by Databricks on top of Apache Spark distributed Computing engine and later open sourced. In line with the unified workloads principle, Delta Lake provides unification of stream and batch processing. It is an excellent fit for Apache Spark based ecosystem and has an edge in ACID transactions support over the others.
Apache Hudi (Hadoop Upserts Deletes and Incrementals) was developed by Uber and open sourced. It is designed to enable real time data access and streaming data analytics. Its indexing mechanism helps in efficient querying over partitions.
Apache Iceberg was a table format developed by Netflix and open sourced. It is optimized for efficient query performance, schema evolution and versioning. It lacks data compaction (make a single file of small files)
In summary, Delta Lake has an edge in ACID transactions, Hudi in streaming data analytics and Iceberg in Datawarehouse workloads.
Metadata layer raises the abstraction level of the underlying data. It primarily contains a repository of data objects in the data lake, the abstractions like tables and their metadata like columns and constraints, the access rules and permissions. Often it has indexes and cache to improve performance. This layer offers lineage, search and discovery of data objects.
Apache Atlas is an open source Data Governance and Metadata framework for Hadoop. If the organization is based on Hadoop ecosystem and has complex data governance requirements, Apache Atlas is an open source option that fits the requirements. Alternatively, if someone wants to have a custom metadata layer, Apache Atlas can be a good starting point and it is extensible. Non-Hadoop environments may require additional effort to use Apache Atlas and hence may not be the right fit.
AWS Glue Catalog is a component in AWS Glue that is responsible for persistent technical metastore in the AWS cloud. AWS Glue is a serverless Data Integration service to discover, prepare and combine (ETL) data for analytics, machine learning and application development. Automatic data crawling and cataloguing, data lineage, schema evolution are some of the strengths of AWS Glue Catalog. It is the right fit for a Lakehouse built using AWS services alone. However, multi-cloud and hybrid cloud scenarios or integration with non-AWS services may not be a good fit.
Unity Catalog from Databricks is a component in Databricks that is specifically built as a metadata and governance solution to handle unified AI and Data on the Lakehouse. Data artefacts like ML models also come under the Governance purview. Fine Grain Access control at the table or row or column level using ANSI SQL alleviates the need for learning a different language. This solution is cloud-agnostic and hence fits well in hybrid or multi-cloud environments. This centralized metadata store can search and discover from external sources like MYSQL, Azure Synapse.
Google Cloud Data Catalog is a metadata management service within Dataplex. Dataplex is a data fabric that unifies distributed data and automates data management and governance for that data. Google Cloud Data Catalog by itself does not provide governance. This option is the right fit in a Google cloud ecosystem for data catalog and metadata management. Governance must e provided through some other means.
Microsoft Purview is a metadata management and governance, Risk and Compliance (GRC) service from Microsoft. It can manage data in hybrid on-premises and cloud model and the data assets could be Azure Storage, Power BI, SQL, Databricks and so on. This is geared more towards business metadata.
All the options discussed provide metadata management services and a varying degree of governance capabilities. Each has unique strengths and operate best within their target ecosystems. You can choose a combination of them for example unity catalog and Microsoft Purview to get the best of both services.
Compute engine performs actual processing in the Lakehouse. Apache Spark has become the de facto standard for distributed data processing and analytics. This is an open source unified data processing engine that combines structured, unstructured, graph, streaming data processing. Its API support both analytics and machine learning data pipelines. It offers API in multiple languages like Python, Scala, R and Java. Connectors are available for most of the components in the big data landscape. All these factors make Apache Spark a popular choice as compute engine in Data Lakehouses.
There are several other processing engines available that perform one task well and often better than Apache Spark. For example, Presto is another open source distributed SQL query engine that can be faster than Apache Spark. Originally developed by Facebook, Presto is an easy to use, deploy or query data using SQL. Presto does not cater to Machine Learning use cases but can be used together with Apache Spark to provide interactive analytics for data sources that can be queries using SQL.
Real time streaming data analytics and complex event processing is one area of use cases that often require more than what Apache Spark can offer. You can add a specialized stream processing engine to the Lakehouse. Apache Kafka, Apache Flink are popular open source stream processing systems whereas Amazon Kinesis and Azure Stream Analytics are available for their respective ecosystems.
So far, we have looked at the individual layers in a Lakehouse. This section focuses on some of the popular Lakehouse offerings from the crowded Lakehouse market. As mentioned earlier, vendor-tied Lakehouse offerings from Oracle, Informatica, IBM, Teradata etc are excluded because they are natural extension to such vendor-centric IT landscapes.
Databricks is one of the pioneers of the data Lakehouse architecture. Databricks data Lakehouse is built on three popular open source software frameworks Apache Spark, Delta Lake and ML Flow. It can be deployed on most of the popular clouds like AWS, Microsoft Azure, Google Cloud Platform (GCP). It does not have its own storage layer and depends on object-level storage like Amazon S3 or Azure Blob or Google Cloud Storage. Databricks also has auto-scaling feature, security and governance. Delta sharing allows data sharing across heterogeneous Lakehouses thereby avoiding vendor lock-in. Databricks Lakehouse is a versatile, all-purpose solution for unified workloads and use cases. You can augment the functionality of Databricks Lakehouse by adding additional compute engines and / or additional libraries (eg. Machine Learning) to enhance performance of specific use cases.
Snowflake data cloud is essentially a self-service cloud-based data ware house platform that can run on AWS, Microsoft Azure, Google Cloud Platform (GCP). It has its own storage layer unlike Databricks. Users can query data using familiar SQL. Snowflake manages compression, partitions, metadata and other data objects automatically and lets user focus on querying data. Similarly, auto-scaling and auto suspend are also managed automatically. Snowflake is best suited for Data warehouse / Business Intelligence use cases. It is not the right fit for AI/ Machine learning use cases or for streaming analytics.
Dremio is an easy to use, serverless, fast, SQL based data Lakehouse providing self-service SQL analytics, data warehouse performance and functionality, and data lake flexibility along with Governance. It works on popular clouds AWS, Azure and GCP. It is built on top of Apache Arrow, an open source query engine. While shared semantic layer is plus, it does not have direct support for AI/ML. You can add that functionality by connecting to external applications. Dremio can be deployed on-premises or private cloud or you can sign up for Dremio Cloud SaaS product. If the focus of your use case is simplicity, open-source, SQL based interactivity and excellent query response, Dremio is the best fit. You can have on-premises community
Microsoft Fabric is an end-to-end analytics solution including data movement, data lakes, data engineering, data integration, data science, real-time analytics and business intelligence. It has in-built data security, governance and compliance. The Data Fabric is conceptually a logical network of inter-connected data platforms. These data platforms can be built using different technologies and different architectures ranging from RDBMS to data Lakehouses. Being a relatively new entry though with promising features, it is better to wait and watch how this product is embraced by the industry.
Data mesh is an architecture and not a product. Data mesh uses complex API integrations across microservices to stitch together systems across the enterprise. Depending on the use case and existing Data infrastructure, you may consider making a mesh of what you have instead of introducing an additional software product into the mix.
We have discussed a few of the products and technologies available for organizational data management using Data Lakehouse architecture. As with many decisions, the starting point is to list down requirements and their priorities, existing assets and skills. Once armed with this vital information, you can explore the available options and arrive at the right fit for your organizational needs of today and near future. Reuse, Repurpose and Rework in the order.
You can leverage our professional expertise in enumerating your needs and design the best, future proofed data management architecture and strategy. we carefully match your current and future requirements with the right ensemble of technologies. The implementation choices shall be made jointly with you in the driver seat and us as the advisors.