# Build a Python Package from your ML Model

This is the second part of the multi-part series on how to build and deploy a machine learning model -
**building and installing a python package out of your predictive model in Python**

This is the second part of the multi-part series on how to build and deploy a machine learning model -
**building and installing a python package out of your predictive model in Python**

Unless you have pushed a data science model into production, there is high chance you have not logged anything going on in your model framework. Preparing proper logs is an essential part of good software engineering, and if you are preparing any model to be deployed to production, its imperative you ensure proper logging is done.

Originally published in Analytics India Magazine

Traditional convolutional layer takes a patch of an image and produces a number (patch -> number). In “transpose convolution” we want to take a number and produce a patch of an image (number -> patch). We need this layer to “undo” convolutions in encoder. This is used specifically for a decoder type operation.

Sequential API and Functional APIs are the two primary ways in which a Deep Learning Model can be built in Keras. Squential API allows you to build a model step by step, layer by layer. However, it does not allow your to make a model which has multiple input or multiple output. This is something which the Keras Functional API can handle

This post is to dissect into a basic Tensorflow code - highlighting the typical structure of a Deep Learning Algorithm implemented using TensorFlow.

This is not intended as a detailed tutorial explaining the functions but to explain the structure of a sample code solving an easy problem- for details on the components see the following posts:

TensorFlow Mathematical Functions

TensorFlow Optimizer Functions

TensorFlow Loss Functions

CodeUp Goa was organized by PayU at 91Springboards Goa. The theme of the talk was Evolving AI and IoT. I presented on the evolution of AI specific to the lending space in FinTech.

Saving and Loading model is one of the key components of building Deep Learning Solutions. Not only they are used in model deployments, but also in Transfer Learning. The already trained model from Millions or Billions of records can be saved and used by others who want to just deploy and use the model or do not have access to huge amount or training data.

The graph generated in a session in Tensorflow can be vizualized using a Tensorboard which generates the Graph model defined in the code in a UI. The standard way is to save the graph on disk in a file, and then load the file via the tensorboard command which runs Tensorboard on **6006** port (which looks like **g00g**le)

Loss functions are the key to optimizing any machine learning algorithm in Tensorflow. It is important to select the right loss function for any machine learning problem which is then fed into the different optimizer functions.

A short list and details of the most commonly used optimizer functions used in TensorFlow. Not all functions are listed here.

This is a summary and brief write up of common (and not so common mathematical functions in Tensorflow) - just a few lines specifying the syntax and how they should be written in TensorFlow.

Classification algorithm using TensorFlow - Application of a 4 layered Neural Network Architecture to solve the Sonar Mines & Rocks dataset classification. The program is at this location on Github

This post details the terms obtained in SAS output for logistic regression. The definitions are generic and referenced from other great posts on this topic. The aim is to provide a summary of definitions and statistical explaination of the output obtained from Logistic Regression Code in SAS.

Anamoly Detection is a class of semi-supervised (close to unsupervised) learning algorithm widely used in Manufacturing, data centres, fraud detection and as the name implies, anamoly detection. Normally this is used when we have a imbalanced classification problem, with, say, y=1(anamoly) is approx 20 and y=0 is 10,000. An example would be identifying faulty aircraft engines based on a wide number of parameters, where the anamolous data might not be available or if it is available, will be less than 0.1%.

ROC or receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. Essentially it illustrates the ability of the classifier to segregate the classses. A higher AUC (Area under the curve)-ROC denotes a better classifier

Similarity functions are used to measure the ‘distance’ between two vectors or numbers or pairs. Its a measure of how similar the two objects being measured are. The two objects are deemed to be similar if the distance between them is small, and vice-versa.

**Term frequency–inverse document frequency (TF-IDF)**, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

This is an implementation of basic and simple LSTM implementation (also called the vanilla LSTM) in Keras. This is a two layered model with simple LSTM in one layer and a final Dense layer. The LSTM model is for **Echo Sequence Prediction**.

This post is just to keep as a ready reference mterial for some statistical terms which come up a lot in maching learning. This will be expanded from time to time to keep up the relevant data anad material.

Apart from clssification and regression problems, Time Series models are a separte entity in itself which are not easily tackled by standard methods and algorithms (well, it can be after some smart tweaks). The main aim of a time series analysis is to forecast future values of a variable using its past values. Time series models are also very business friendly, and directly solve some business problems like “What will be my stores sales in the nest two months” or “How many customers are going to come in my pizz store tomorrow, so that I can optimize my ingredients”

Handwritten digit recognition using MNIST data is the absolute first for anyone starting with CNN/Keras/Tensorflow. It is a well defined problem with a standardizd dataset, though not complex, which can be used to run deep learning models as well as other machine learning models (logistic regression or xgboost or random forest) to predict the digits.

GloVe stands for Global Vectors for word representation. Previously we were picking up context (c) and target(t) in the window randomly. GloVe makes this selection explicit.

How to learn the Embedding matrix for a task? For example, the task is to predict the next word in the sequence “I want a glaass of orange ????”. Building a neural language model is one of the ways to learn word embeddings.

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Word Embeddings are the core of applying the RNNs to Natural Language Processing Tasks. Embedding are used to convert words and sentences to ‘numbers’ which the computer can not only understand, but also use them for NLP tasks such as Man:Woman = King:? -> Queen.

The typical RNN model works in a way such that the past sequences affect the next sequence, while in reality a particular output could get influenced by both the sequences before it and sequences after it. BiDirectional RNNs (or BRNN) take into account this effect in its architecture

In LSTM, as compared to GRUs, $\Gamma_r$ is not used. In place of $\Gamma_r$, two separate gates $\Gamma_u$ and $\Gamma_f$ (forget gate) are used. Also, $a^{\langle t \rangle} \ne c^{\langle t \rangle} $.

Gated Recurrent Units (GRUs) are a form of RNN which can capture long range dependencies in a sequential data.

The sequence input to RNNs can be really long, and its quite possible that the inputs in the beginning on the sequence will decide the output units sometime later, by when the gradients will not be strong enough to affect the output.

A language model estimates the probability of a sentence (sequence or words) occuring together, and provides a comparison between different possible combination or variants of similar sentence.

There are various variations of RNN which depends on the type of input data provided and the type of output required, and also on the problem we are trying to solve using Recurrent Neural Networks. Each variation has a different RNN architecture and the implementation depends on this architecture.

Backpropagation in a RNN is required to calculate the derivates of all the different parameters for optimization function using Gradient Descent. The gradient is propagated back in the network across all layers and instances <1>,<2>,…

The standard neural networks cannot take into account the sequence that come before or after a data point. For example, to identify a name in a sentence, we need knowledge of the other words surrounding it to identify it. In the belowmentioned senteces, in (1) ‘Teddy’ refers to a name, while it refers to a toy in (2).

The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps (more on this later)

Text mining and analysis is one of the most widely used implementation of data science and deep learning. The challenge in performing text mining stems from the fact that humans and computers perceive text data differently. While ahuman can figure out the context easily, it is not so easy for the computers. Also, an algorithm sees a corpora of text as a matrix of numbers, while we see it through the eyes of ‘language’ structure. There are many more such differences, but what the latest deep learning algorithms have been able to do is simply astonishing !!

XGBoost algorithm belongs to the family of Boosting algorithms which provide better and faster results than the traditional classification/regression algorithms. They are widely being used as the go-to algorithms for a lot of Machine Learning Tasks.

Some of the key codes for date time manipulation using datetime library of pandas. Useful for time series problems and doing feature engineering based on dates.

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.

Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

Model evaluation metrics are used to assess goodness of fit between model and data, to compare different models, in the context of model selection, and to predict how predictions (associated with a specific model and data set) are expected to be accurate. The choice of metric completely depends on the type of model and the implementation plan of the model.

Ensemble methods involve group of predictive models to achieve a better accuracy and model stability. Ensemble methods are known to impart supreme boost to tree based models.

There are various ways to decide on the metric to choose the variable on which splitting for a node is done. Different algorithms deploy different metrices to decide which variable splits the dataset the the best way.

Overfitting is one of the key challenges faced while modeling decision trees. If there is no limit set of a decision tree, it will give you 100% accuracy on training set because in the worse case it will end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:

Data Vizualization is a part of Exploratory data analysis - charts and graphs can tell you much more than what a simple table or a bunch of numbers tell you.

Exploratory Data Analysis or EDA is the most impostant part of any project or code related to data, as it helps you to understand more about the data before arriving at any hypothesis.

There are various ways to preprocess the data after the basic exploratory analysis with data - mostly to convert the data to fit into the model.

A really cool implementation of CNN is the Neural Style Transfer for Art Generation. It basically merges two images - one **Content** image and other **Style** image to create a new image which is a combination of the two.

Understanding the algorithm behind the Facial Recognition & Facial Verification technologies and the associated loss functions and technical details. I will also be building a code from scratch (will be posted separately - this post is mostly algorithms and mathematics) for Face Recognition using CNN

**Boundary Box Prediction**

Convolution Neural Networks are a class of neural network which take into account not just the vector of inputs, but also takes into account the spatial arrangement of data. An example would be a transactional data which can be analyzed using a tradition neural network where the spatial arrangement or order of inputs does not mattr, as compared to images which are almost exclusively analyze using CNN where the arrangement of pixels around each other is of paramount importance (infact this is exactly what makes up an image).

A lot of researchers have done great research in proposing and formulating different architectures for CNN which have been proven in different scenarios with different datasets. All of them are constantly evolving, with later ones performing better than the older versions due to novel techniques and emerging algorithms.

There are multiple building blocks in a CNN architecture. With each operation (Fiters, pooling, convolution etc), the dimension of output matrix changes. It is extremely important to keep track of matrix dimensions to make sure the calculations are done in a correct way

Network in Network or more commonly known **1x1 Convolution** are used to manipulate the depth of input channels

HYperparameter tuning in Deep Learning: Learning rate $\alpha$, $\beta , \beta_1, \beta_2, \epsilon$, number of layers, number of hidden units, learning rate decay, mini-batch size

Batch normalization makes the Hyperparameter tuning easier and makes the neural network more robust. Also enables to train a big network easier and faster.

When the derivatives/slope from becoming too big or too small

Gradient Descent is widely used as an optimization algorithm for optimizing the cost functions. There are various improved version of these algorithms like StochasticGradientDescent, Gradient Descent with Momentum, RMSprop and Adam.

How to make sure the implementation of Backpropagation is correct?

Activation functions are required in every neuron unit of a neural network layer to convert the parametrized equation ()to a number which can be fed into the next layer. The backpropagation algorithm also uses the derivatives of these activation functions to propagate the error back into the networks, and minimize the final cost.

Testing the Github Blog