**The first step to gaining insight from analytics is ensuring the data is organised, consistent and workable.****Data wrangling is considered the most challenging part of analytics. it typically takes 80% of the time in any data project. **

Bringing structured and unstructured data together from diverse sources including internal and external sources in any format and preprocessing them is the first step of the modern data science that can unleash the power of data. Data preprocessing means discovering, structuring, cleaning, harmonising, validating and interpolating of data sets in order to derive deep insights from advanced analytics and machine learning.

There are several techniques in data wrangling stage to manipulate datasets and convert messy data set before feeding it to the modelling algorithm. The main techniques include:

1- Normalise data.

2- Standardise data

3- Remove rows with missing values

4- Impute missing values

5- Dealing with categorical data

This is a typical example of data wrangling using Python:

**1- Normalise data**

If the data set is comprised of several variables no matter explained or explanatory one, many machine learning algorithms can benefit from rescaling (Normalise) of the variables to all have the same scale typically into the range between 0 and 1. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbours.

# Rescale data (between 0 and 1)

import pandas

import scipy

import numpy

from sklearn.preprocessing import MinMaxScaler

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”

names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))

rescaledX = scaler.fit_transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(rescaledX[0:5,:])

**2- Standardise Data**

Standardisation to refer to the subtraction of the mean and then dividing by its SD (standard deviation). Standardisation transforms our data such that the resulting distribution has a mean of 0 and a standard deviation of 1.

We can standardise data using scikit-learn with the StandardScaler class.

# Standardize data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler

import pandas

import numpy

url=”https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”

names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = StandardScaler().fit(X)

rescaledX = scaler.transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(rescaledX[0:5,:])

**3- Handling missing data**

The simplest strategy for handling missing data is to remove records that contain a missing value. The risk of using this method to prepare data is, the number of observations reduces drastically.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.

Pandas provides the dropna() function that can be used to drop either columns or rows with missing data.

from pandas import read_csv

import numpy

dataset = read_csv(‘pima-indians-diabetes.csv’, header=None)

# mark zero values as missing or NaN

dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)

# drop rows with missing values

dataset.dropna(inplace=True)

# summarize the number of rows and columns in the dataset

print(dataset.shape)

**4- Impute missing values**

Another technique to deal with missing values in a dataset is to replace the empty cells of the column of a variable with a random data. The risk of this method is to artificially increase the biasness of our analysis using a random value to replace the missing values. There are several methods to deal with missing data. For example, we can use:

• A “constant value” to replace missing values for the column of a variable.

• Mean, Mode to replace missing values for the column of a variable

**5- Categorical data**

is very common in business datasets. For example, users are typically described by country, gender, age group etc., products are often described by product type, manufacturer, seller etc., and so on.

Categorical data is very convenient for people but very hard for most machine learning algorithms, however, many machine learning models (e.g., SVM) are algebraic, thus their input must be numerical. Using these models, categories must be transformed into numbers first before we can apply the learning algorithm.

While some ML packages might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs (like our beloved scikit-learn).

We consider three methods to deal with categorical data:

• Encoding to ordinal variables

The order will be selected randomly (for example, like the order in the dataset or in an alphabetical order). This method does not make much sense because if we are encoding 3 different states in the US, namely, New York, California and New Mexico our algorithm might assume that New Mexico is more similar to New York than California.

• One hot encoding (or dummy variables)

This method takes each category value and turns it into a binary vector of size |i|(number of values in category i) where all columns are equal to zero besides the category column.

• Feature hashing

Feature hashing is a very cool technique to represent categories in a “one hot encoding style” as a sparse matrix but with a much lower dimension. In feature hashing we apply a hashing function to the category and then represent it by its indices

Method: Python

Library: scikit-learn using MinMaxScalar.