Can we get some details regarding cost and performance. The kind of stuff that Microsoft don’t talk about and we can, something that makes people come to this page and read this article. Some thing that provides an unbiased unfiltered opinion that you cannot find on Microsoft website. Eg: BLOB storage is supposed to be 8X slower then storing data on solid-state drives. How can we improve the performance by incrementally moving hot data to solid-state drive. Or by caching hotter data… can we compare this to lake house on aws as well?
Microsoft Azure provides a broad range of services and solutions and came a long way since its inception in October 2008. Azure combines different cloud offerings like Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). Given its global presence, hybrid cloud capabilities and ubiquitous presence of Power BI and Office 360 in the enterprises, Azure becomes a safe go-to-choice for many organizations.
Data Lakehouse is the current standard data architecture to future proof organizational data strategy. This article presents different ways your organization can leverage your existing license, infrastructure while future proofing your data strategy by building a Lakehouse with least disruption to the work culture.
Lakehouse paradigm is a relatively new architecture that combines the best concepts from data lake and data warehouse models. The scalability and flexibility of data lakes together with the reliability and performance of data warehouses along with governance, traceability, security and data discoverability makes a Lakehouse a must have for current day Data management.
Lakehouse is made up of multiple layers. Building a Lakehouse using Azure can be fitted better to the organizational requirements by choosing the right fit at each layer, as discussed here:
The term “compute” refers to the hosting model for the resources that your application runs on. Microsoft Azure Virtual Machine (Azure VM) is the choice for the compute layer. You can choose virtual private cloud (VPC) or public cloud or a hybrid. Most secure portions can reside on on-premises infrastructure or a Private cloud and the rest in the public cloud. In addition, hybrid cloud is evolving to cater to edge computing use cases, where data collected from IoT devices can also be processed effectively.
Actual specifications of the Compute layer including the count and types of VMs is highly specific to individual organizational requirements and preferences
Azure Blob Storage is the best fit for the storage layer. Depending on the size and usage patterns, you may have to opt for Azure Data Lake Storage Gen2 to include support for big data analytics.
Azure Blob Storage can store massive amounts of structured and unstructured data. It is a scalable, durable, highly available storage optimized for data lakes. Authentication with Azure Active Directory and role-based access control (RBAC), encryption at rest make the data highly secured. Tiered storage to keep premium(performance-sensitive), hot, warm and cold (rarely used), archive data with different pricing models reduces cost without compromising on performance.
Azure Data Lake Storage Gen2 is a set of capabilities that you use with the Blob Storage service. It offers massive scalability, Hadoop-compatible access and optimized cost and performance. The capability of hierarchical directory structure and the associated partition pruning brings in dramatic optimization when working with tools like Spark. Attribute-based access control capability provides finer control at the logical attribute level instead of file level.
The term “Delta” refers to technologies related to or in the Delta Lake open source project. Delta Lake adds schema to data lake and allows for ACID transactions, versioning (time travel), schema enforcement, schema evolution capabilities.
You can implement Delta lake using either Azure Data Factory (ADF) or Azure Synapse Analytics. Azure Data Factory is Azure’s cloud ETL service for data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. You can write transformations copy data in and from Delta Lake stored in Azure Lake Storage Gen2 or Azure Blob Storage using the “Delta” format.
Microsoft Purview is a cloud-native family of data governance, Risk and Compliance (GRC) to help your organization manage and govern all data assets. It provides a single, unified data management service for the data from all sources, in the data lake, and in end reporting tools. Insights in Azure Purview provide multiple predefined reports to gain a detailed understanding of the data landscape.
The “Microsoft Purview Data Catalog” can automatically capture and describe core characteristics of data at the source, including schema, technical properties, and location. The “glossary” in the Data Catalog allows a business-friendly definition of data to be layered on top, to improve search and discovery. You can define access control, data ownership and stewardship to data catalog and glossary items in the catalog.
Metadata in turn is stored in the storage layer mentioned above.
Apache Spark has become the de facto standard in unified data processing and is a must have in the processing layer. forms the processing layer. Purpose of this layer is to have the data pipelines that transform the source data into the form expected by the consumption layer. In addition to Apache Spark, you can choose specialized products or libraries that are required for the unique requirements of your organization. For example, you may want Presto for distributed SQL queries or Azure Stream Analytics for streaming workloads.
“Apache Spark in Azure HDInsight” is the Microsoft implementation of Apache Spark in the cloud, and is one of several Spark offerings in Azure. “Spark Pools in Azure Synapse Analytics” use managed Spark pools to allow data to be loaded, modeled, processed, and distributed for analytic insights within Azure. You can add “Spark activity in Azure Data Factory” which is nothing but a spark job that runs on a HDInsight cluster.
Azure Synapse Analytics brings its own computing and visualization capabilities. It brings together the best of SQL technologies used in enterprise data warehousing, Apache Spark technologies for big data, and Azure Data Explorer for log and time series analytics.
So far, we have covered individual layers in the Lakehouse architecture and how we can build them using Azure ecosystem. This section introduces Lakehouse as a single logical service.
Synapse Analytics is an enterprise analytics service, that brings together the SQL technologies used in enterprise data warehousing and Spark technologies used for big data. It has a Data Explorer to analyze log and time series. Synapse pipelines are available to write ETL/ELT data pipelines with low or no code. Being an Azure native service, it has deep integration with other Azure services like Power BI and AzureML. In fact, all the Lakehouse layers mentioned earlier are well integrated into the Synapse Analytics service or a part of the service.
The Azure Databricks Lakehouse Platform is an integration of Databricks Lakehouse with Azure cloud platform. It is a single bundle of all architectural layers discussed in this article. Databricks data Lakehouse is built by the company Databricks, one of the pioneers in Lakehouse architecture. Databricks Lakehouse does not include a cloud of its own but can be deployed on most of the popular clouds including Microsoft Azure. Databricks also has auto-scaling feature, security and governance. Delta sharing allows data sharing across heterogeneous Lakehouses thereby avoiding vendor lock-in. Databricks Lakehouse is a versatile, all-purpose solution for unified workloads and use cases. You can augment the functionality of Databricks Lakehouse by adding additional compute engines and / or additional libraries (eg. Machine Learning) to enhance performance of specific use cases.
Fundamental focus of a Data Lakehouse is unification of data silos and different use cases into a single logical data management entity. While the central idea is very convincing, it is far too idealistic for many real-world scenarios. For example, the barriers between subsidiaries within the same organization or the disjoint IT Landscapes possibly due to mergers and acquisitions may be best left as they are.
There are two popular solutions to have a network of independent data platforms with clearly defined, mutually accepted data flows among them. One solution is “Data mesh” that connects these separate data platforms via complex API integrations. The second solution is gaining a lot of attention these days and it is called “Data Fabric”.
Data Fabric is a logical network of inter-connected data platforms, built using different technologies ranging from traditional RDBMS to data Lakehouses. It enables a centralized access while decentralizing the ownership of different data platforms. Essentially, it creates a virtualized data layer on top of data sets, removing the need for the complex API and coding work.
Microsoft has unveiled Microsoft Fabric in May 2023 and is still in preview stage. This is a single product that encompasses OneLake as storage layer built on top of Azure Data Lake Storage Gen2, Data Factory for data ingestion and transformation, Synapse Data Engineering, Synapse Data Science, Synapse Data Warehousing, Synapse Real-Time Analytics, Power BI and Data Activator. Fabric is infused with Azure OpenAI Service (ChatGPT is from OpenAI) at every layer to help customers leverage the power of generative AI.
Microsoft Azure cloud services ecosystem has become a popular core member in many organizational IT landscapes with ubiquitous Office365 being the front office suite. Data Lakehouse is the logical next step in data storage and processing landscape. This article has presented different Azure services and their roles in building several layers of a data Lakehouse. Mapping the organizational current and future requirements to the right mix of services and processes is the key for a successful data platform that delivers the promise of actionable insights in time. Such a mapping exercise requires expertise in various inter-related disciplines.
Y Point Analytics has been specializing in data management including Data Strategy, Data Integration, Business Intelligence, Data Lakehouse and Artificial Intelligence. We have been supporting several clients from Federal, Health care, Senior care, Pharma verticals. Our long standing, repeat customers are the true testimony of our service quality. If you want to know more about how we can help you, please contact us.