Machine Learning (ML) and Deep Learning (DL) are evolving at a faster pace than ever. There is now a vast selection of different competitions, many of which are hosted on Kaggle, where thousands of data scientists and ML engineers compete to obtain higher positions on the scoreboard. Don't get me wrong - I am very fond of Kaggle. It is a great place to start and enhance your machine learning adventure. Also, competitions are places where science thrives - we all remember the space race, don’t we?
However, such places reinforce an endless pursuit for the best score. Everyone wants to beat the state-of-the-art results, which surely leads to everlasting fame and glory (at least until someone gets a better score, which nowadays takes like a month or two). There is even a term coined specifically to describe this notion: Kagglefication, which is discussed here. It’s posed as one of the major problems in ML at the moment - to get to the top we do not only build large models, but also group them into ensembles of tens (soon maybe hundreds?) at a time. Systems like that make DL even more obscure and uninterpretable than it already is. This is not maintainable; you cannot just stack models on top of one another like there is no tomorrow, caring only about the score. Sadly, the chase for the best score is usually passed on to new adepts of ML, who also strive to make their models reach a score that is even a tiny bit higher.
I would like to point out that it can be worthwhile to step back a little and realise what exactly your ultimate goal is. Usually, it is to deliver a model yielding the best performance, and that might not be equivalent to a model yielding the best score.
But what exactly is that performance? What does the word even mean? When in doubt - have a look at the dictionary:
Performance - the manner in which or the efficiency with which something reacts or fulfils its intended purpose.
Well, okay. So what exactly is efficiency?
Efficiency - the state or quality of being efficient, or able to accomplish something with the least waste of time and effort; competency in performance.
That makes more sense. Since we are talking computer science here, I’ll allow myself to add one sentence about algorithmic efficiency from Wikipedia:
In computer science, algorithmic efficiency is a property of an algorithm which relates to the number of computational resources used by the algorithm.
Okay, it is getting clearer now. In order for our ML model to achieve the best performance, we have to ensure that it not only fulfils its intended purpose, but also does it with the least waste of time, effort, and computational resources. Probably some of you already know where I am going with this...
It does not do us any good if we make the best model in terms of accuracy, or any other metric for that matter, if we don’t have the resources to run inference on the end device. Nor does it make any sense to have a face recognition model which needs an hour to identify a person. And it doesn’t matter by how much it has beaten the state-of-the-art on a given dataset.
Therefore there is an important step in designing an ML model - recognise the environment it is supposed to be run on. There is a vast difference between running a model on a mobile device, a Raspberry Pi, a PC with CUDA capabilities, or an specially designed computing unit. You wouldn't like your mobile app to weigh 60 MB while the model alone takes 58 MB, would you? This should give us some upper bounds of how large model can we afford. Also, training large models can be tricky due to the memory constraints. Especially convolutional networks are memory-hungry and can cause resources exhaustion if you are not careful.
What about the lower boundary? What can we do about it? Intuition tells us that smaller models, with fewer parameters, usually perform worse. Sometimes it is not the case, as the lower number of parameters helps with regularization because the model is required to learn more general features, and don't have enough capacity to focus on the specific features of the training set. On the other hand, a lower parameter count directly grants a smaller memory footprint, and usually equals a shorter training and inference time (usually, not always - the relation is not monotonic).
It means that there is a decision to be made - a trade-off between the model’s accuracy and the computational resources required to run inference and training - a thing that is usually forgotten. A lot of people like to claim that their model beat the state-of-the-art, but not many are brave enough to add that it literally took a month to train. The training time also influences how fast and often you can deploy new models, which might matter a lot in a dynamically evolving application.
Now, there are some important questions which you have to ask yourself, all are related to your ultimate goal:
What do you prefer - having a model whose training for 20 epochs takes 1 hour and it reaches 90% accuracy, or one with a training time of 45 minutes and 85% accuracy?
Would a customer enjoy your app more if its size was 10 MB instead of 20 MB, and made predictions 5 times faster at a cost of an 8% higher error rate?
What about the training itself - do you think the gain in score after training for more epochs would bring more revenue than you spend paying for the cloud (or electricity)?
What about your time? Is it worth to spend three days of your time on hyperparameter tuning? How much do you gain in return?
These are not questions which can be answered easily. The answers are different if you are an ML engineer in a big company or an ML enthusiast having fun. It all depends on your ultimate goal.
How to get that ability to estimate the right number of parameters? First you have to realise that there is no such thing. However, you can make your expectations sound, and adjust the parameter count to reach these expectations. To do that you have to rely on your experience and expertise (at least for now, work is being done towards enabling us to perform this step automatically, like fastPSO or AutoML). The more models you make and evaluate, the better intuition about this whole parameter estimation thingy you shall have. Unfortunately, those are very bad news for beginners with very little or no experience.
Let's assume that you've just finished your ML course in Python and you are ready to dive deep into DL. Probably somewhere along the way you learned that there are some basic datasets, quite suitable for new adepts of this discipline. Probably the most well-known example is MNIST - a database of images of handwritten digits. So you want to build your very first model and train it on MNIST, you make your project, download the data, write all imports, loaders, start to define your graph, and now... what? How do you pick the number of filters? How large should your fully-connected layer be? Well, one of the solutions is to go to Google and ask!
I've collected three models, originating from the repositories of Keras and PyTorch, two of the most famous Python deep learning packages:
Keras MLP - it has two fully-connected hidden layers with 512 nodes each, with a dropout layer in between. It has 669,706 parameters. The authors claim that it gets to 98.40% test accuracy after 20 epochs.
Keras CNN - it has three hidden layers: two convolutional with 32 and 64 filters with the kernel size at 3 x 3, a max pooling layer, dropout, fully-connected with 128 nodes, followed by a second dropout layer. It has 1,199,882 parameters (almost twice as many as MLP). Gets to 99.25% test accuracy after 12 epochs.
PyTorch - it has four hidden layers: two convolutional with 10 and 20 filters with the kernel size at 5 x 5, each followed by a max pooling layer, a dropout, two fully-connected layers with 320 and 50 nodes, with a dropout layer in between. It has only 21,840 parameters, which is around 50 times fewer than Keras CNN, and 30 times fewer than Keras MLP!
Let's see how they perform. I've trained all of them with the same training schedule - the cross-entropy loss was minimized for 20 epochs using an SGD optimizer with Nesterov momentum (0.9) and a starting learning rate of 5e-3. The learning rate changed to 5e-4 after the 10th epoch, and to 5e-5 after the 15th. The batch size was chosen empirically and set to 16. The graph below shows the test loss and test accuracy during training for each model.
All the models reach more than 95% test accuracy after the first epoch. Keras CNN gets the first place with 99.26%. The second place goes to PyTorch with 98.96%, and the last place belongs to Keras MLP with 98.60%. What is interesting is that the PyTorch model has only a 0.3% higher error rate than the Keras CNN, while having 50 times fewer parameters. I shall focus on Keras CNN and PyTorch models from now on.
Now, coming back to the trade-offs - would you trade that 0.3% higher error rate to have 50 times fewer parameters? You might ask, what does this 50-fold reduction in parameter count give you? That is a very good question. Here is some info about the training and the models themselves:
PyTorch model requires 539 MB of GPU memory to train with batch size 16 (and 1247 MB with batch size 8192). Moreover, it takes 20 seconds to train one epoch on a CPU, and 2 seconds to run inference on the whole test set. The physical size of the model (size of all the weights) is 0.08 MB.
Keras CNN model requires 909 MB to train with batch size 16 (and 5911 MB with batch size 8192). Also, one epoch of the training time takes 120 seconds, and the inference time is 10 seconds. The physical size of the model is 4.58 MB.
I compared training and testing times on a CPU, not a GPU, to make these differences more evident, as the performance of GPU training strongly depends on memory allocation, and MNIST is not the best dataset to exploit the GPU speedup. Using a smaller model saves us 100 seconds of CPU time, and time is money. Literally, if you train in the cloud.
Moreover, if we use larger images, then the difference in memory usage would be even more significant, up to a point where we wouldn't be able to train a large model even with batch size = 1 (for example for 3D data like satellite hyperspectral images or medical MRI scans). Now, if you wanted to incorporate your Keras CNN model into a mobile app, you would have to account for an additional 4.6 MB just for the model, while for the PyTorch model, the required space is only 0.08 MB.
It is worth to mention that MNIST is a very simple and not demanding dataset, therefore using a model with 1M+ parameters is definitely an overkill. Because of that this 50-fold reduction sounds like a blast. More demanding datasets probably wouldn't allow you to get so much of an improvement. Still, every percent matters, because it influences performance, so don't feel bad if you managed to reduce it by "only" 5%. Some models are so optimized that further reduction is not possible without a significant drop of the score.
I hope you are convinced now that sometimes using a smaller model instead of a larger one could be beneficial, even if it achieves lower scores. It all depends on the dataset and the environment that you want deploy your model in. Also, that there are factors that not everybody counts as resources - the training time, the inference time, and finally - your personal time. In the industry you have to mind your surroundings and optimize for the best performance, not necessarily the best scores.
The real world is not a Kaggle competition.
Let's see how we can strip down the models even more. Looking at the PyTorch model summary (which is a very informative thing to do) we can have an idea where the majority of the parameters lie.
The convolutional layers have reasonable sizes, but most of the parameters are packed in the first fully-connected (Linear-4) layer. Yeah, let's lose those and replace them with one convolutional layer to restore the required (10 for MNIST) number of outputs followed by a Global Average Pooling (GAP) layer, which is named AdaptiveAvgPool2d-4 here.
I reduced the number of parameters three times, down to 7090, and the size to 0.03 MB! The training revealed that the model (let's call it PyTorch CNN) achieved a test set loss of 0.0344, and an accuracy of 98.93%. This means that I didn't lose practically any accuracy, while saving resources, and thus improving the performance! All that with few lines of code.
I played around a bit more (I strongly encourage any of you to do the same) with even further reduction of the size. When I went down to two layers I did a grid search over the parameter space to perform the final fine-tuning of the model's parameters, keeping in mind that I don't want to lose more than 5% of the score, meaning that the final model has to reach at least 95% accuracy. I came up with this Small Net:
It consists of three layers plus Batch Normalization (BN) and GAP. The first is a convolutional layer with 5 filters and kernel size 5 x 5 with BN. The second layer is another convolutional with 8 filters 3 x 3, followed by a GAP layer. The last layer is fully-connected (it might as well be convolutional with kernel 1 x 1). The whole model has only 518 parameters! I had to add BN to speed the training up a bit, at a cost of 10 parameters. Also, I've noticed that such small models tend to suffer from dead neurons, therefore I hoped that LeakyReLU would help (it did, not much though). This reduced the training time to 11 seconds per epoch, and inference time to just 1 second. Probably playing with batch size would give a broader view on the performance, but this is out of the scope of this post.
The Small Net achieved test a set loss of 0.1387 a and test set accuracy of 96.01%! Interestingly enough, the number of parameters is actually lower than the number of pixels in a single image, which is 784 (28 x 28).
There is a trade-off between the score (which should be high) and the number of parameters (which should be low). There is a simple rule of thumb - the number of parameters should be in the same order of magnitude as the number of observations, or data points. MNIST has 60k images (observations), therefore something like 100k is a good starting number for the upper bound. On the other side - the number of parameters should be in the order of magnitude of your feature count. The MNIST images have 784 pixels (features), therefore 1000 is a good lower bound. It’s worth to know that there are ways to reduce the number of features, like the one described here. I recommend to start rather with architectures having more parameters, and consequently try to lower the count, than the other way around.