Computer data processing emerged in the 1960s and has evolved significantly, transitioning from flat file processing to modern cloud-based architectures. Initially, organizations aimed for centralized data storage, using RDBMS to handle both daily transactions and reporting. However, this approach proved inefficient, as heavy analysis loads hindered transaction speeds, and normalized data storage was impractical for aggregated reporting.
To address this disparity, workloads were separated into Online Data Processing (OLTP) and Online Analytical Processing (OLAP). Complex Extract Transfer Load (ETL) workflows with intermediate staging areas became crucial in IT landscapes. Data Warehouses have served Enterprise Scale Data effectively for 30 years. The advent of Web scale data, cost-effective hardware, and distributed processing led to new use cases and solutions.
The term Big Data has taken the Data Management and allied worlds by storm. Traditional storage, compute, DB design rules were put to test. Rigid schemas have become inadequate. Horizontal scaling, Massively Parallel Processing (MPP), distributed computing, cloud storage and computing, NoSQL data stores have become first class citizens in the Data Management landscape. Data Governance is recognized as a main function as against a line-item responsibility of the IT head.
Artificial Intelligence and Machine Learning became more affordable, requiring specialized data engineering pipelines. Real time streaming data analytics, Complex Event Processing have become integral parts of applications especially in FinTech. Data management team has the additional responsibility of creating and maintaining Data engineering pipelines in addition to the traditional OLAP pipelines.
Almost all organizations embraced cloud data storage and compute, hybrid on-premises and cloud combinations or multi-cloud combinations. Data landscape designs are greatly influenced to be in sync with the cloud provider pricing models. Cloud Data Warehouses are just one example of how the solutions were re-invented to meet the change in demands.
The remaining article focuses on the changes that happened in the past 15 to 20 years with an attempt to look into the future.
The rise of smart devices and real-time data sources like social media added complexity to the ingestion layer, necessitating changes in data architecture to accommodate these evolving usage patterns. Data lakes were introduced to facilitate storage of disparate data sources for immediate and possible future use. They have offered several advantages over its predecessors and many organizations readily implemented data lakes. The key reasons for this adaptation are:
Soon the shortcomings of the Data Lake architecture were quite evident. Its advantages like no need for schema also meant there was no enforcement of rules. Since a Data Lake is designed for read only storage of raw files, modifications became impractical. Alternatives were to store several, small append only files or keep adding multiple, full copies of the same file with minor modifications. Moreover, the disparate file formats with possibly different interpretations (eg: Different currencies) made data pipelines complex to write and maintain.
Delta lake layer was introduced on top of the Data Lake layer to add schema and insert/ update/ delete capabilities. Delta lake is an open-source framework built to work with any cloud and any query engine. Delta lake does not do away with Data Lake but just adds structure to it. It can be treated as a “format” of storing data in a Data Lake. Delta lake provides a lot of advantages over its predecessor. For example,
Delta lake uses versioned Apache Parquet files to store data in the cloud storage. In addition to these files which are also called Delta tables, Delta Lake stores a transaction log to keep track of all commits made to the Delta table. After every few transactions (called commits), a checkpoint file is created that includes all the transactions that occurred after the previous checkpoint. Apache Spark internally handles reading from Delta tables and transaction logs to construct the latest data.
Delta lake had simplified the usage of Data Lake in many ways, but it has its own share of limitations such as:
Data Lakehouse combines the best concepts from data lake and data warehouse models and is built using three proven open-source frameworks, Apache Spark, MLFlow and Delta Lake thereby imbibes the advantages of all these frameworks.
The Lakehouse architecture is composed of a sequence of layers: Data Lake, Delta Lake and then a layer of metadata. The metadata layer stores centralized metadata catalog of all data objects, their versions, access control lists and organizational governance rules. These rules can be applied at logical Table / View / Attribute level as against physical file level.
Data Fabric is a logical network of inter-connected data platforms, built using different technologies ranging from traditional RDBMS to Data Lakehouses. Data Fabric enables a centralized access while decentralizing the ownership of different data platforms.
Data Mesh is a similar but differently implemented solution of federation. Data mesh uses complex API integrations across microservices to stitch together systems across the enterprise. In contrast, Data Fabric creates a virtualized data layer on top of data sets, removing the need for the complex API and coding work.
We have covered major milestones over decades of computerized record keeping and analysis. The enterprises always wanted accurate data and an ability to get answers they seek. The tremendous advancements in the hardware and software technologies have been addressing these fundamental requirements of ever-growing size and rapidity of organizational advancements. Enterprise scale to web scale to the advent of machine generated (e.g. IoT, system logs) streaming data have been challenging the brightest of IT minds and driving towards more effective design paradigms. Semantic Knowledge Graphs, active metadata, prescriptive or pre-emptive analytics are already becoming mainstream.
Perhaps the next five years will focus on driving down the costs while processing orders of magnitude more data. Self service solutions and chatbots as the User Interfaces is likely to become the main stay. The simple ask of any organization, to be able to interact with the systems without depending on IT department seems to become a reality very soon.
Y Point Analytics has been specializing in data management including Data Strategy, Data Integration, Business Intelligence, Data Lakehouse and Artificial Intelligence. We have been supporting several clients from Federal, Health care, Senior care, Pharma verticals. Our long standing, repeat customers are the true testimony of our service quality. If you want to know more about how we can help you, please contact us.