EVOLUTION OF DATA MANAGEMENT

From Flat files to Delta Lake

Evolution of Data Management

Over the past several decades, the field of data management has undergone a remarkable transformation, evolving from the early days of flat file processing to the sophisticated, cloud-based architectures of today. This article takes you on a captivating journey through the key milestones that have shaped the industry, shedding light on how organizations have adapted to the ever-growing data demands.

In this article, you learn how the initial focus on centralized data storage and relational database management systems (RDBMS) proved inefficient, as heavy analytical workloads hindered transaction speeds. This led to the separation of online transaction processing (OLTP) and online analytical processing (OLAP) workloads, giving rise to complex extract, transform, and load (ETL) workflows and the emergence of the data warehouse as a crucial enterprise-scale solution.

The article then delves into the impact of the big data revolution, as the sheer volume, velocity, and variety of data challenged traditional storage and computing paradigms. Concepts like horizontal scaling, massively parallel processing (MPP), and NoSQL data stores became integral to the data management landscape, while the importance of data governance was widely recognized.

The introduction of data lakes offered a new approach to data storage and processing, enabling the immediate ingestion of disparate data sources. However, the article highlights the shortcomings of this architecture, leading to the development of Delta Lake – an open-source framework that adds structure, ACID compliance, and schema evolution capabilities to the data lake model.

Looking to the future, this article explores the concept of the Data Lakehouse, which combines the strengths of data lakes and data warehouses, and the emerging trends of Data Fabric and Data Mesh, which aim to provide centralized access while decentralizing data ownership. The coming years will likely focus on driving down costs while processing exponentially more data, with self-service solutions and chatbots becoming the primary user interfaces.

Uncover the details that will shape the future of data management 

RECEIVE CASE STUDY

We'll send you full case study in PDF format by LinkedIn or email shortly.

Name

EVOLUTION OF DATA MANAGEMENT

From Flat files to Delta Lake

Evolution of Data Management

Computer data processing emerged in the 1960s and has evolved significantly, transitioning from flat file processing to modern cloud-based architectures. Initially, organizations aimed for centralized data storage, using RDBMS to handle both daily transactions and reporting. However, this approach proved inefficient, as heavy analysis loads hindered transaction speeds, and normalized data storage was impractical for aggregated reporting. 

 

To address this disparity, workloads were separated into Online Data Processing (OLTP) and Online Analytical Processing (OLAP). Complex Extract Transfer Load (ETL) workflows with intermediate staging areas became crucial in IT landscapes. Data Warehouses have served Enterprise Scale Data effectively for 30 years. The advent of Web scale data, cost-effective hardware, and distributed processing led to new use cases and solutions.  

 

The term Big Data has taken the Data Management and allied worlds by storm. Traditional storage, compute, DB design rules were put to test. Rigid schemas have become inadequate. Horizontal scaling, Massively Parallel Processing (MPP), distributed computing, cloud storage and computing, NoSQL data stores have become first class citizens in the Data Management landscape. Data Governance is recognized as a main function as against a line-item responsibility of the IT head. 

 

Artificial Intelligence and Machine Learning became more affordable, requiring specialized data engineering pipelines. Real time streaming data analytics, Complex Event Processing have become integral parts of applications especially in FinTech. Data management team has the additional responsibility of creating and maintaining Data engineering pipelines in addition to the traditional OLAP pipelines. 

 

Almost all organizations embraced cloud data storage and compute, hybrid on-premises and cloud combinations or multi-cloud combinations. Data landscape designs are greatly influenced to be in sync with the cloud provider pricing models. Cloud Data Warehouses are just one example of how the solutions were re-invented to meet the change in demands. 

 

The remaining article focuses on the changes that happened in the past 15 to 20 years with an attempt to look into the future. 

Data lake

The rise of smart devices and real-time data sources like social media added complexity to the ingestion layer, necessitating changes in data architecture to accommodate these evolving usage patterns. Data lakes were introduced to facilitate storage of disparate data sources for immediate and possible future use. They have offered several advantages over its predecessors and many organizations readily implemented data lakes. The key reasons for this adaptation are: 

  • Immediate ingestion of data without careful design. 
  • Extract Transform Load (ETL) gave way to Extract Load Transform (ELT). Only required data is transformed.
  • Cloud provided cheaper, distributed storage whereas BigData tools like Apache Spark made complex multi-engine processing possible at scale. 
  • Well suited for use cases like real time analytics (log monitoring, clickstream analysis) and predictive modelling.

Soon the shortcomings of the Data Lake architecture were quite evident. Its advantages like no need for schema also meant there was no enforcement of rules. Since a Data Lake is designed for read only storage of raw files, modifications became impractical. Alternatives were to store several, small append only files or keep adding multiple, full copies of the same file with minor modifications. Moreover, the disparate file formats with possibly different interpretations (eg: Different currencies) made data pipelines complex to write and maintain. 

Delta lake

Delta lake layer was introduced on top of the Data Lake layer to add schema and insert/ update/ delete capabilities. Delta lake is an open-source framework built to work with any cloud and any query engine. Delta lake does not do away with Data Lake but just adds structure to it. It can be treated as a “format” of storing data in a Data Lake. Delta lake provides a lot of advantages over its predecessor. For example, 

  • ACID compliance like a traditional RDBMS. A transaction is saved in all or none basis without leaving the data in an inconsistent state.
  • Saves serialized snapshots there by providing isolation among concurrent users. One user’s write at time t1 does not impact another user’s read that started before t1 (Read consistency).
  • Schema evolution while maintaining backward compatibility.
  • Handles structured, unstructured like data lake and handles streaming data. 
  • Logs all change details providing an audit trail.
  • Time travel (Query as of certain time in the past) capability to support full historic audit trails and reproducible machine learning experiment.
  • Data is kept compacted, which consolidates several small files into a large file. This improves query performance, reduces storage overhead.
  • Powered by Apache Spark and hence capable of scalable metadata handling.

Delta lake uses versioned Apache Parquet files to store data in the cloud storage. In addition to these files which are also called Delta tables, Delta Lake stores a transaction log to keep track of all commits made to the Delta table. After every few transactions (called commits), a checkpoint file is created that includes all the transactions that occurred after the previous checkpoint. Apache Spark internally handles reading from Delta tables and transaction logs to construct the latest data. 

Delta lake had simplified the usage of Data Lake in many ways, but it has its own share of limitations such as: 

  • Lack of Fine Grain Access control at logical level. For example, at row or view or attribute level.
  • Delta Lake does not support multi-table transactions and foreign keys. ACID compliance is limited to single table mutation only.

Data Lakehouse

Data Lakehouse combines the best concepts from data lake and data warehouse models and is built using three proven open-source frameworks, Apache Spark, MLFlow and Delta Lake thereby imbibes the advantages of all these frameworks.  

 

The Lakehouse architecture is composed of a sequence of layers: Data Lake, Delta Lake and then a layer of metadata. The metadata layer stores centralized metadata catalog of all data objects, their versions, access control lists and organizational governance rules. These rules can be applied at logical Table / View / Attribute level as against physical file level 

Data Fabric

Data Fabric is a logical network of inter-connected data platforms, built using different technologies ranging from traditional RDBMS to Data Lakehouses. Data Fabric enables a centralized access while decentralizing the ownership of different data platforms.      

Generic Data Fabric Reference

Data Mesh

Data Mesh is a similar but differently implemented solution of federation. Data mesh uses complex API integrations across microservices to stitch together systems across the enterprise. In contrast, Data Fabric creates a virtualized data layer on top of data sets, removing the need for the complex API and coding work.  

Conclusion

We have covered major milestones over decades of computerized record keeping and analysis. The enterprises always wanted accurate data and an ability to get answers they seek.  The tremendous advancements in the hardware and software technologies have been addressing these fundamental requirements of ever-growing size and rapidity of organizational advancements. Enterprise scale to web scale to the advent of machine generated (e.g. IoT, system logs) streaming data have been challenging the brightest of IT minds and driving towards more effective design paradigms. Semantic Knowledge Graphs, active metadata, prescriptive or pre-emptive analytics are already becoming mainstream.  

 

Perhaps the next five years will focus on driving down the costs while processing orders of magnitude more data. Self service solutions and chatbots as the User Interfaces is likely to become the main stay. The simple ask of any organization, to be able to interact with the systems without depending on IT department seems to become a reality very soon. 

Data Management with Y Point

Y Point Analytics has been specializing in data management including Data Strategy, Data Integration, Business Intelligence, Data Lakehouse and Artificial Intelligence. We have been supporting several clients from Federal, Health care, Senior care, Pharma verticals. Our long standing, repeat customers are the true testimony of our service quality. If you want to know more about how we can help you, please contact us. 

Get in Touch

Hidden
Name(Required)