Modern Machine Learning Project Workflow

First step is already done. With no doubt you can expect a boost of possibilities for your startup.
Next phase is planning. What kind of tools and algorithms would match best for your case? What kind of infrastructure you have to invest to? How long would experiments iterations take?
These are pretty straightforward questions challenging a team while planning Machine Learning features. Let’s assume we have these answers. If we are ready, the next step is development and experiments parts.
Do you think the most important questions and doubts are behind us? Not at all. Both precise planning and reliable development process do the job right. Let’s focus on the second part today.
We might think the Machine Learning developer creates a baseline AI model, trains and evaluates the dataset, analyses the results and repeats the process with some feedback. Sounds like not much complex job. At least in terms of IT architecture. It is true. Assuming the most basic approach, engineer is able to repeat the process manually. Does that sound like an efficient, stable and safe solution? We can do it better.
If you find a perfect workflow to your Machine Learning project, you have to focus on three stages: data management, model/experiments flow and deployment. In this article I will cover the first two of them.
Data
First of all, doing lots of Machine Learning experiments relate to the fact we deal with big volume of data. It is often that we work in a distributed way, with another developers, on different machines and environments. It’s not a big deal to imagine basic issues with data - related with paths to files, differences between data versions etc. Especially after the phase of initial deployment of model to client’s environment and initial validation of results. It is a good idea to get rid of these problems. The solution we introduces in Netguru is Quilt. It is a Python library and a cloud service for data packages management. You can find a project website here: https://quiltdata.com/.
Quilt is a really smart Python package working smooth with data science libraries.
Package authors promise that data management would be held relatively close to code management if we use the tool.
It works in the following way: You upload your dataset first, then in Python you import your dataset like a Python module. Here you can see the basic example:
$ pip install quilt
$ quilt install uciml/iris
$ python
>>> from quilt.data.uciml import iris
# you've got data
In fact, that builds a reproducible manner of keeping data as a single input. Since we import data from a single source, we perform versioning task properly and do not introduce a redundancy of data.
Quilt works fine with Pandas library and Jupyter notebooks. As you can see, importing data to the most basic format is really fast.
Second advantage is versioning of data. With the help of Quilt you can push several versions of the same datasets in case of data change, completion or corrections. Each version is represented by a unique hash code. That gives us a possibility to jump between dataset versions similarly to checkout between code versions in Git. That looks promising! We already see the data management is being held closely to code management.
First step with Quilt would be set up your account. If you work in a team over the project, it is possible to consider Quilt Team Edition plan. Alternatively you can try your personal account.
If you wonder to work with sample datasets accessible online, you don’t even have to own an account. You are able to download the data by, at first, searching for the data on Quilt website. You should note username and name of dataset and then use it to import data in Python with command
quilt install <username>/<dataset_name>
python
> from quilt.data.<username> import <dataset_name>
And that’s it! With a specific namespace you have access to your data in well known DataFrames format, compatible with NumPy and Pandas libraries.
As you can see above, Quilt package offers handy CLI commands to install and maintain data packages. We can discover the full potential just by creating and deploying our own package.
To achieve that, we should build the package first. We can approach it in two ways.
Firstly, we can build the data implicitly just by zipping unstructured data, like text files or images, into package. Assuming we have all the data files (for example as .csv, .txt, .jpg files) in subdirectory /data
related to current directory, we can type:
quilt build <our_username>/<package_name_in_quilt> ./data
That builds local package. Then, we can push the package to Quilt repository by
quilt push <our_username>/<package_name_in_quilt> --public
And that’s it!
Secondly, we can create a structure of our data, something similar to metadata file. It is done by YAML file. We can create sample configuration file by typing
quilt generate ./data
It simply builds build.yml
and README.md
files. Example YAML configuration is visible below:
contents:
iris:
file: iris.data
transform: csv
kwargs:
header:
dtype:
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
class: str
Then we can build and push our package using metadata created previously
quilt build <our_username>/<package_name_in_quilt> build.yml
quilt push <our_username>/<package_name_in_quilt> --public
Experiments
We are already done with data management issues and solutions. In the moment, we have an ability to care about versions and integration of our datasets. We can use them from different machines and share them with other contributors.
We are ready to build a baseline model first, then improve its performance by iterating variety of experiments. Without a good plan on board we would lose a bunch of time triggering the following runs of our Machine Learning framework. Perfectly fit model needs a numerous of probes. Let's handle this task smarter.
Netguru recommends Polyaxon, an Open source platform created to help in machine learning lifecycle management. It supports the most common Machine Learning libraries - Keras, Tensorflow, Caffe, PyTorch. On the other hand, it allows to deploy the whole management system on Amazon Web Services, Kubernetes and Docker solutions.
What it does
Distributed management system is a great idea. You can handle your model’s teaching processes from different machines and configurations. Polyaxon unifies an access to your experiments and introduces scheduling of teaching tasks. It also pushes a good coding practice in Machine Learning - the code describing your model should be parameterless - all magic numbers defining the current model should be passed to evaluators as params.
Let’s imagine the following scenario:
- We create a baseline model, related to our intuitions to achieve the very first results. We gather the accuracy metric along with loss value.
- Just after observing that we want to apply the first part of fine tuning. We reconsider the number of units in specified layers, along with other changes, for example other activation functions or loss function changes.
- Next part is continuing fine tuning process. It is possible that we need to try other model architecture - other number of layers, we can also think about introducing LSTM, convolutional layers etc. , depending on what we want to achieve.
- If we localize the best architecture, we want to stay with the best model look and try to reconsider the number and types of units. In this step we simply try to achieve the best score and minimise the loss.
The recipe tells us we need to evaluate lots of different model’s details. That is why we would like to drop hardcoding model architecture details in code. We define an abstract model based on outside parameters and define evaluation and training methods inside. Secondly, we want to define experiments plan. The plan would include a list of specific experiments. Each of them would describe the number and type of units, activation methods, number of epochs and other parameters needs to be specified. In case of parallel computations we also need to define some resource limits reserved for the single experiment. Polyaxon allows us to set it up in an efficient way.
Configuration
To integrate our project with Polyaxon we have to create a new project inside by command:
polyaxon project create --name="Name_of_the_project" --description="description
of the project"
Then we can initiate Polyaxon configuration files by:
polyaxon init "Name_of_the_project"
We can define each experiment as Polyaxon file in Yaml format. It is called polyaxonfile.yml
by default. Below you can see an example:
---
version: 1
kind: group
framework: tensorflow
tags: [examples]
hptuning:
concurrency: 5
random_search:
n_experiments: 10
matrix:
learning_rate:
linspace: 0.001:0.1:5
num_layers:
values: [44, 45, 46]
momentum:
values: [0.85, 0.9, 0.93]
declarations:
batch_size: 128
num_steps: 5000
build:
image: tensorflow/tensorflow:1.4.1
build_steps:
- pip install --no-cache-dir -U polyaxon-client==0.4.2
run:
cmd: python run.py --train-batch-size= \
--train-steps= \
--learning-rate= \
--momentum= \
The file specifies both framework to use, build and run commands for our experiments and hyperparameters we want to use in the specified experiment. It also allows to specify resources to use, for example GPU or CPU cores. Polyaxon creates experiments mixing all product of values of learning_rate, number of layers and momentum.
If we are ready with all experiments, we can upload the plan to Polyaxon cloud by:
polyaxon upload
And then we can run experiments by:
polyaxon run -f ./polyaxonfile.yml
Summary
Within Quilt and Polyaxon tools you can easily setup and configure an elegant workflow for your Machine Learning project. The first service would serve the data in a similar way a developer imports libraries to Python project. It would also guarantee data integration in case of format, versioning and availability. The second one allows to schedule various Machine Learning experiments - training and evaluation of the next versions of models.
Following improvements help the Machine Learning Engineer achieve his goals more reproducible, faster and in a more flexible way.