A Machine Learning model is as good as the data it is fed. Often this data is sourced from disparate data sources with their own errors and idiosyncrasies. Bringing all data into a homogenous, machine learning algorithm friendly form often takes 80% of a model building exercise. The data flow through various steps of ingestion, cleaning, pre-processing and preparation is called data engineering pipelines. The process is called data wrangling or data munging.
Machine Learning models largely follow the following process flow, though each of the steps again is iterative in nature. Often you need to go back to previous step or steps and restart with some variation somewhere. Visual, exploratory analysis at every stage determines the next steps.
The steps mentioned above clearly indicate iterations and backtracking leading to multiple combinations. Each attempt is aptly called an experiment. Even though the insights generated by the deployed model are what we are interested in, we will be spending most of our time in data preparation. Given the complexity of various combinations that quickly become unmanageable, we have to manage different data flows carefully. Each atomic transformation is considered as one step. Each data flow is a data engineering pipeline which is a sequence of steps to follow.
Fig 1: Data model building process
Fig 2: Data engineering process
Every data engineering pipeline is unique but majority have the following steps:
Identify useful data sources, data samples, lineage, business rules, domains (e.g. Male, Female or M/F), related security and encryption rules of the organization as well as the data owners.
Defining the homogenised data structure with units. For example, you may choose to convert all dates to DD-MON-YYYY, currencies to USD and so on.
Every Data source is cleaned before merged with others. De-duplication, dropping erroneous values, missing values imputation, standardization of units and domain standardization outlier treatment, standardization and normalization are some common cleaning activities.
Cleaned data further undergoes transformations to make the data machine learning model-friendly. There are some common techniques like binning, one-hot encoding, outlier removal, standardization and normalization of continuous values. These have to be performed to improve predictive power of the input data.
Data enrichment can be invaluable in dimensionality reduction and improving semantic richness of the data. This is an optional step and makes sense only in some cases.
Examples include adding new fields like adding business names instead of technical names, adding demographics to customer information, adding public transport points to hotels based on zip code.
By this time data has gone through several transformations and should be verified for quality, accuracy and consistency.
All available data is not useful as model building. For examples, unique identifiers, nearly unique columns like timestamps don’t provide us any patterns. Only columns that have high predictive power should be selected for model building. Some times we may have to employ feature crosses (product of two or more input columns) or custom transformations.
Non-numeric input should be converted into numerical vectors before passing to machine learning models. This is called feature extraction.
Feature extraction is especially useful in processing natural language text, images and videos.
After undergoing all the above steps, data is ready for consumption. Publishing is the end point of the pipeline. Typically published data is stored for reuse in a data store. You can call it feature store. Features may be versioned. Matured organizations start building their ML models from feature store instead of raw data as far as possible.
With our exhaustive experience in different yet related fields of data processing, we have built reusable data engineering pipelines and utilities that encapsulate our knowledge, best practices, common activities along with visualizations. These reusable software libraries coupled with best in class development processes that are backed by time tested, template driven requirements gathering expedite the application development processing. We have witnessed cutting the development time from weeks to days.
Given the complexity of the iterative nature of model building exercise, version control is extremely important. Our disciplined methodology of treating every data engineering pipeline as a versioned process flow that is again a defined as a sequence of versioned steps. Each versioned step has dependencies that are versioned. All the steps are defined as code. Data sources are versioned and become the leaf level dependencies. The versioning metadata is stored along with each experiment. Some commonly used features are added to feature store. Depending on the requirements, these may be pre-executed data vectors or executable code with clearly identified versioned dependencies.
Strict adherence to carefully laid processes, best practices, security, measures make the entire process manageable and the models become reproducible.