Skip to the content.

Build a Python Package from your ML Model

This is the second part of the multi-part series on how to build and deploy a machine learning model -  building and installing a python package out of your predictive model in Python

Image1

The first part on building pipelines can be read here

The first part covers how to re-write your model code into the form of a sklearn pipeline for easy understanding, management, and edits. A model can be deployed without the pipeline structure, but it is always the best practice to make pipelines and separate different parts of the code (config, preprocessing, feature engineering, data, and tests).

This post builds up from the earlier code of building a pipeline. If you had difficulty following the previous article, you can read on how to build sklearn-pipelines on the Internet, and then look at the GitHub repos for each stage of package building

Part 1: Organize code in pipelines, Training the model

The directories are restructured as in the image below

Image1

This is just a part of the code which uses three main files: pipeline.py, preprocessors.py, and train_pipeline.py. Apart from this, train.csv and test.csv are stored in the folder /packages/regression_model/datasets .

Every folder must have a init.py file (they are not present in the GitHub repo)

The GitHub repo for Part 1 is here

Details of directories:

Packages: Root folder containing the package .

Regression_model: Name of the package

Datasets: Test.csv and train.csv - Kaggle datasets on Housing price predictions downloaded from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Trained_model: the place for saving the models in .pkl file

Files:

Pipeline.py : Build a pipeline with all the operations
Preprocessors.py: All the fit and transform functions used in the pipeline
Train_pipeline.py: Running the model and saving the models Requirements.txt: All the necessary packages with versions which need to be installed

Prerequisites before running the model and training

Create a new environment

pip install virtualenv
virtualenv my_env_name
source my_env_name/bin/activate

Building a new environment is recommended for various reasons. Read about it here[]

Add your directory to PYTHONPATH

Here is how to do it for mac [google for other OS, it is quite straightforward]

1. Open Terminal.app 
2. Open the file ~/.bash_profile in your text editor - e.g. atom ~/.bash_profile 
3. Add the following line to the end: export PYTHONPATH=""
4. Close terminal
5. Open and test $ echo PYTHONPATH

Installing package: Need to run the following command with the correct location of requirements.txt file

$ pip install -r requirements.txt

Running the model (training):

$ python packages/regression_model/train_pipeline.py

Image1

Output: a new file regression_model.pkl is generated in the packages/regression_model/trained_models folder

Part 2: Restructuring the project, making predictions and writing tests

The project needs to be restructured (will be explained when building package) so that we have a separate package directory with its own requirements.txt file, as well as a separate test module for testing the models before deployment .

GitHub repo for part 2 is here

Folder Structure

Image1

Note the new structure - there is a regression_model folder inside regression_model inside packages

The Github repo does not include init.py files, please add them (blank files, no content) before running

Adding Test folder will be covered just after this block, need to install PyTest for this

Major Changes

Packages/regression_model/regression_model/Config/config.py:

Config files with all the fixed variable names, features, name of train and test data, target variable. This is done to clean up the code and make it more readable. Also if something needs to be changed (say the name of the file or removing a feature), it can be done only at one place rather than going through the code

Using the config files:

from regression_model.config import config

Packages/regression_model/regression_model/processing/data_management.py This contains functions to load_dataset, save_pipeline and load_pipeline. This cleans up the train_pipeline.py code Using data_management.py

from regression_model.processing.data_management import ( load_dataset, save_pipeline)

Training the model (ensure you have added PYTHONPATH to environment variable as explained earlier)

$ python packages/regression_model/regression_model/train_pipeline.py

Make Predictions

$ python packages/regression_model/regression_model/predict.py

This will not print anything. To test if the modules are working fine, Test modules have to be added

Testing

New Directory for Test at packages/regression_model/tests 

Image1

test_predict.py contains the code for testing the model

Requirements.txt: Add

# testing
pytest>=4.6.6,<5.0.0

Writing tests is optional but it is always recommended. This will ensure that you model does not break at any point after you make any major or minor change.

Read more about tests here

Contents of test_predict.py: Just check the first prediction is correct

import math

from regression_model.predict import make_prediction
from regression_model.processing.data_management import load_dataset


def test_make_single_prediction():
    # Given
    test_data = load_dataset(file_name='test.csv')
    single_test_json = test_data[0:1].to_json(orient='records')

    # When
    subject = make_prediction(input_data=single_test_json)

    # Then
    assert subject is not None
    assert isinstance(subject.get('predictions')[0], float)
    assert math.ceil(subject.get('predictions')[0]) == 112476

Running Tests:

$ pytest packages/regression_model/tests -W ignore::DeprecationWarnings

Image1

Part 3: Building the package

At this stage, your code is complete and has passed all the tests. The next step is building a package.

GitHub repo for Part 3 is here

These things need to be added to the current directory:

MANIFEST.in: provides detail on what files to keep in the package

include *.txt
include *.md
include *.cfg
include *.pkl
recursive-include ./regression_model/*

include regression_model/datasets/train.csv
include regression_model/datasets/test.csv
include regression_model/trained_models/*.pkl
include regression_model/VERSION

include ./requirements.txt
exclude *.log

recursive-exclude * __pycache__
recursive-exclude * *.py[co]Ne

Setup.py: Other details on the model, meta-data, requirements, license information and other details

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import io
import os
from pathlib import Path

from setuptools import find_packages, setup


# Package meta-data.
NAME = 'regression_model'
DESCRIPTION = 'Train and deploy regression model.'
URL = 'your github project'
EMAIL = 'your_email@email.com'
AUTHOR = 'Your name'
REQUIRES_PYTHON = '>=3.6.0'


# What packages are required for this module to be executed?
def list_reqs(fname='requirements.txt'):
    with open(fname) as fd:
        return fd.read().splitlines()


# The rest you shouldn't have to touch too much :)
# ------------------------------------------------
# Except, perhaps the License and Trove Classifiers!
# If you do change the License, remember to change the
# Trove Classifier for that!

here = os.path.abspath(os.path.dirname(__file__))

# Import the README and use it as the long-description.
# Note: this will only work if 'README.md' is present in your MANIFEST.in file!
try:
    with io.open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
        long_description = '\n' + f.read()
except FileNotFoundError:
    long_description = DESCRIPTION


# Load the package's __version__.py module as a dictionary.
ROOT_DIR = Path(__file__).resolve().parent
PACKAGE_DIR = ROOT_DIR / NAME
about = {}
with open(PACKAGE_DIR / 'VERSION') as f:
    _version = f.read().strip()
    about['__version__'] = _version


# Where the magic happens:
setup(
    name=NAME,
    version=about['__version__'],
    description=DESCRIPTION,
    long_description=long_description,
    long_descripation_content_type='text/markdown',
    author=AUTHOR,
    author_email=EMAIL,
    python_requires=REQUIRES_PYTHON,
    url=URL,
    packages=find_packages(exclude=('tests',)),
    package_data={'regression_model': ['VERSION']},
    install_requires=list_reqs(),
    extras_require={},
    include_package_data=True,
    license='MIT',
    classifiers=[
        # Trove classifiers
        # Full list: https://pypi.python.org/pypi?%3Aaction=list_classifiers
        'License :: OSI Approved :: MIT License',
        'Programming Language :: Python',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.6',
        'Programming Language :: Python :: Implementation :: CPython',
        'Programming Language :: Python :: Implementation :: PyPy'
    ],
)

Packages/regression_model/regression_model/requirements.txt:

This is another requirements.txt file inside the package. This needs to be provided. There are two additional packages that needs to be installed for packaging, so make sure you run 

$pip install -r packages/regression_model/regression_model/requirements.txt
# production requirements
numpy==1.15.4
scikit-learn==0.20.2
pandas==0.23.4

# packaging
setuptools==40.6.3
wheel==0.32.3

# testing requirements
pytest>=4.6.6,<5.0.0

Run: Command for building source distribution (sdist) and wheel distribution (bdist_wheel)

$ python packages/regression_model/setup.py sdist bdist_wheel

Image1

If all goes well, you’ll have the following new files in your directory .

Image1

This will depend on your OS. This is built on MacOS 10.15

Your package is now ready to be installed and used - just like a normal Python package

Install Package

$ pip install -e packages/regression_model/

Use Package

Image1

The next post will cover some of the best practices (I know, there are a lot of them) - versioning & logging, and how to host this package on the web from where anyone can install this. Future posts will cover the deployment as an API - on Heroku and AWS

Sources:

  • https://www.geeksforgeeks.org/python-virtual-environment/
  • https://realpython.com/python-testing/
  • Udemy Course on Deployment of ML Models Course by Soledad Galli & Christopher Samiullah - If you really want to go deep with proper software writing, logging, CI/CD, Flask and deployment on multiple platforms - You should do this course https://www.udemy.com/course/deployment-of-machine-learning-models/
Written on January 29, 2020
]