Data Preprocessing in Python

There are various ways to preprocess the data after the basic exploratory analysis with data - mostly to convert the data to fit into the model.

Key Data Pre-processing Steps:

  1. Missing Value Imputation
  2. Checking for Correlation among Variables
  3. Checking for MultiCollinearity among Variables (VIF)
  4. Removing certain variables from the dataset
  5. Converting Numeric Data to Normalized form
  6. Label Encoding of Categorical Variables
  7. One-hot encoding of categorical variables
  8. Creating a target variable
  9. Transformation of data to another scale (ex. log scale)
  10. Splitting data into Training and Testing/Validation Sets

1. Missing Value Imputation

Find out the missing values by each variable

data.isnull().sum()  #Will output variable wise missing values

Filling the mising values with a value

median = train_df['Age'].dropna().median()  ## gets the median of non-na values in the dataset
train_df['Age'].fillna(median)

2. Checking for Correlation among Variables Correlation between numeric attributes

data.corr()

Plot correlation matrix

figure = plt.figure(figsize=(20,20))
plot = figure.add_subplot(111)
corr_col = plot.matshow(data.corr(method='pearson'),vmin=-1, vmax=1, interpolation='none', cmap = plt.cm.Spectral_r)
figure.colorbar(corr_col)
columns= data.columns

ticks = np.arange(0,14,1)
plot.set_xticks(ticks)
plot.set_yticks(ticks)
plot.set_xticklabels(columns)
plot.set_yticklabels(columns)
plt.show()

Test Image

4. Removing certain variables from the dataset

data.drop(['species','variable2'],axis=1, inplace=True)

5. Converting Numeric Data to Normalized form

from sklearn.preprocessing import StandardScaler, MinMaxScaler
scalerStandard = StandardScaler().fit(X_train)
rescaledX = scalerStandard.transform(X_train)

scalerMinMax = MinMaxScaler()
rescaledXmm = scalerMinMax.fit_transform(X_train)

6. Label Encoding of Categorical Variables

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
Y_encoded = encoder.fit_transform(Y)

or

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoder_y = label_encoder.transform(Y)

In case 2 or 3 variables are present, they can be encoded by using the map function ex. titanic dataset

train_df['Gender'] = train_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

8. Creating a target variable Converting to Numeric Array and then Splitting (only if all the values are numeric, no catgorical unless encoded

array = data.values
X = array[:,0:4]
Y = array[:,4]

or take out Y from the dataset, and drop it in the current data

Y = data['species']
data.drop(['species'],axis=1, inplace=True)

10. Splitting Data into Training and Testing

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(dataset, Y_encoded, train_size=0.2, random_state = 42)
Written on January 4, 2018
]