Machine Learning

There are several stages to Build an end-to-end machine learning pipeline and develop a Machine learning model.

The following list includes  the main ML stages:
1- Performing descriptive data exploration
2- Common techniques for data preparation
3- Choosing the right ML model to train data
4- Optimisation techniques
5- Deploying and operating models

1- Performing descriptive data exploration

1- Analyze the data distribution and check for the following:

  • Data types (continuous, ordinal, nominal, or text)
  • Mean, median, and percentiles
  • Data skew
  • Outliers and minimum and maximum values
  • Null and missing values
  • Most common values
  • The number of unique values (in categorical features)
  • Correlations (in continuous features)

2- Analyse how the target variable is influenced by the features and check for the

  • The regression coefficient (in regression)
  • Feature importance (in classification)
  • Categorical values with high error rates (in binary classification)

3- Analyse the difficulty of your prediction task.

2- Common techniques for data preparation

After the data experimentation phase, you should have gathered enough knowledge
to start preprocessing the data. This process is also often referred to as feature

  • Labeling the training data
  • Normalisation and transformation in machine learning

For example, we can run a Python script to normalise data as follow:

The first step to gaining insight from analytics is ensuring the data is organised, consistent, and workable.
Data wrangling is considered the most challenging part of analytics. it typically takes 80% of the time in any data project. 

Bringing structured and unstructured data together from diverse sources including internal and external sources in any format and preprocessing them is the first step of the modern data science that can unleash the power of data. Data preprocessing means discovering, structuring, cleaning, harmonising, validating, and interpolating data sets in order to derive deep insights from advanced analytics and machine learning.
There are several techniques in the data wrangling stage to manipulate datasets and convert messy data set before feeding it to the modeling algorithm. The main techniques include:
1- Normalise data.
2- Standardise data
3- Remove rows with missing values
4- Impute missing values
5- Dealing with categorical data
This is a typical example of data wrangling using Python:

1- Normalise data
If the data set is comprised of several variables no matter explained or explanatory one, many machine learning algorithms can benefit from rescaling (Normalise) of the variables to all have the same scale typically into the range between 0 and 1. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbours.

# Rescale data (between 0 and 1)

import pandas

import scipy

import numpy

from sklearn.preprocessing import MinMaxScaler

url = “”

names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))

rescaledX = scaler.fit_transform(X)

# summarize transformed data


2- Standardise Data
Standardisation to refer to the subtraction of the mean and then dividing by its SD (standard deviation). Standardisation transforms our data such that the resulting distribution has a mean of 0 and a standard deviation of 1.
We can standardise data using scikit-learn with the StandardScaler class.

# Standardize data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler

import pandas

import numpy


names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = StandardScaler().fit(X)

rescaledX = scaler.transform(X)

# summarize transformed data


3- Handling missing data
The simplest strategy for handling missing data is to remove records that contain a missing value. The risk of using this method to prepare data is, the number of observations reduces drastically.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.
Pandas provide the dropna() function that can be used to drop either columns or rows with missing data.

from pandas import read_csv

import numpy

dataset = read_csv(‘pima-indians-diabetes.csv’, header=None)

# mark zero values as missing or NaN

dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)

# drop rows with missing values


# summarize the number of rows and columns in the dataset

4- Impute missing values
Another technique to deal with missing values in a dataset is to replace the empty cells of the column of a variable with random data. The risk of this method is to artificially increase the biasness of our analysis using a random value to replace the missing values. There are several methods to deal with missing data. For example, we can use:
• A “constant value” to replace missing values for the column of a variable.
• Mean, Mode to replace missing values for the column of a variable

5- Categorical data
is very common in business datasets. For example, users are typically described by country, gender, age group, etc., products are often described by product type, manufacturer, seller, etc., and so on.

Categorical data is very convenient for people but very hard for most machine learning algorithms, however, many machine learning models (e.g., SVM) are algebraic, thus their input must be numerical. Using these models, categories must be transformed into numbers first before we can apply the learning algorithm.
While some ML packages might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs (like our beloved scikit-learn).
We consider three methods to deal with categorical data:

• Encoding to ordinal variables
The order will be selected randomly (for example, as the order in the dataset or in alphabetical order). This method does not make much sense because if we are encoding 3 different states in the US, namely, New York, California, and New Mexico our algorithm might assume that New Mexico is more similar to New York than California.
• One hot encoding (or dummy variables)
This method takes each category value and turns it into a binary vector of size |i|(number of values in category i) where all columns are equal to zero besides the category column.
• Feature hashing
Feature hashing is a very cool technique to represent categories in a “one-hot encoding style” as a sparse matrix but with a much lower dimension. In feature hashing, we apply a hashing function to the category and then represent it by its indices

Method: Python
Library: scikit-learn using MinMaxScalar.


  • Scaling to unit length or standard scaling
  • Minimum/maximum scaling
  • Mean normalization
  • Quantile normalization
3- Choosing the right ML model to train data

Similar to data experimentation and preprocessing, a training ML model is an analytical, step-by-step process.

  • Define your ML task
  • Pick a suitable model to perform this task
  • Pick a train-validation split
  • Pick or implement an error metric
  • Train a simple model using cross-validation

Once the answers to these questions are gathered, you can go back to the fun part: improving the model performance by data analysis, feature engineering, and data preprocessing.

  • Choosing an error metric
  • The training and testing split
  • Achieving great performance using tree-based ensemble models
  • Modeling large and complex data using deep learning techniques
4- Optimisation techniques

If we have trained a simple ensemble model that performs reasonably better than
the baseline model and achieves acceptable performance according to the expected
performance estimated during data preparation, we can progress with optimization.

  • Hyperparameter optimization
  • Model stacking


5- Deploying and operating models

Once you have trained and optimized an ML model, it is ready for deployment.

The deployment and operation of an ML pipeline can be best seen when testing
the model on live data in production. A test is done to collect insights and data to
continuously improve the model. Hence, collecting model performance over time is an
essential step to guaranteeing and improving the performance of the model.
In general, we differentiate two architectures for ML-scoring pipelines, which we will
briefly discuss in this section:

  • Batch scoring using pipelines
  • Real-time scoring using a container-based web service