Exploratory Data Analysis for datasets in Python

Exploratory Data Analysis or EDA is the most impostant part of any project or code related to data, as it helps you to understand more about the data before arriving at any hypothesis.

Without giving too much gyaan, lets jump into the codes :)

Pyhton EDA Steps:

Read Data

Read Data - parse dates, from csv, excel, txt, .data, saved libraries info(), summary(), describe, shape head()

Data Frame Slicing:

Separate target variable, remove ID variable Merge train and test datasets, add a new column as flag Separate train and test datasets based on flag Add a new variable by modifying two existing ones Convert date format to yyyy-mm-dd, extract columns for year, month and day Filter on multiple conditions

Individual Variable Analysis:

Value_counts() Separate numerical and categorical list of variables, store cols in new vars Count missing values, sort by count, store in a new table Fill missing values, numeric by mean and categorical by mode Missing Value imputation separately for the numerical and categorical Correlation between numeric - plot Correlation - drop cols with corr > 0.7 Outlier detection - delete a specific data point

Vizualize: [Use plt and seaborn both]

Plot target variable histogram/ distplot QQ Plt for checking normality Category - Quant Boxplots (Multiple) Plots with secondary axis Relationship of target variable to all variables - pairplots/boxplot

Data Preprocessing for modeling:

Split into train and test based on a split ratio Categorical - Label encoding and one hot encoding - IMPORTANT Label encoding - frequency encoding Normalizing the Numerical variables: Standard Scalar, MinMax

Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotib inline # for Jupyter notebooks

Data Loading

data = pd.read_csv("iris.csv")

Basic data exploration statements

data.head()
data.shape  # Number of Rows and cols
data.info()  #
data.describe()  #Count mean mode median percentiles

Map Function with a dictionary

#Map Species name to single character
dict={'setosa':'SE','versicolor':'VE','virginica':'VI'}
data["Spec"]= data['species'].map(dict)

Drop a column

data.drop(['species','Spec'],axis=1, inplace=True)

Delete a specific Row

#Delete row with sepal_length = 5.1, sepal_width = 3.5
data.loc[(data['sepal_length']==5.1)&(data['sepal_width']==3.5),:]

data.drop(data[(data['sepal_length']==5.1)&(data['sepal_width']==3.5)].index,inplace=True)

Merge dataframes data and target column by column (side by side)

target = data['Spec']
data.drop(['species','Spec'],axis=1, inplace=True)
data = pd.concat([data,target], axis=1)

Extract SE dataframe

SE = data.loc[data['Spec']=='SE']
SEVI = data.loc[data['Spec'].isin(['SE','VI'])]

Counts

SEVI.Spec.unique()
data.Spec.value_counts()

Separate Numerical and Categorical Variables

categorical = data.select_dtypes(include=['object']).columns
numeric = data.select_dtypes(include=['float64','int64']).columns

data.loc[:,categorical].head()

Missing Value Counts

data.isnull().sum()

Correlation

corr= data.corr()

Aggregate Functions

Groupby

data.groupby('Spec')['sepal_length'].mean()
data.groupby('Spec')['sepal_length','sepal_width'].sum()
Written on January 4, 2018
[ ]