As you might expect, there was a significant machine learning component to this project, and I was the machine learning engineer responsible for it. Namely, we wanted to train a machine learning model that detects when the baby is crying.
This task, to anyone initiated, sounds like a simple classification problem and, to a degree, it is. I encountered, however, several interesting problems connected with deploying models on mobile devices and dealing with audio input. In this article, I’ll describe the main challenges I’ve overcome and insights I’ve gained from my work on this project as a machine learning engineer.
Ok, so, at this point, I had a clearly enough defined goal (i.e. to detect whether a baby is crying or not) and no data. Where to start? What to do first? The initial step in a machine learning engineering loop is to “get a number” as quickly as possible - to build up enough of the system so that we can evaluate its performance and begin iterating. For example, we can find something that can be easily adjusted to our goals online and start from there.
Fortunately, someone had already worked on this problem before and even published all the code on Github, so I didn’t have to reinvent the wheel from scratch. There’s even training data included in this repository.
How does this ML system work?
Here’s how the machine learning system in this repo works. The input to the system is a 5-second long audio file, preferably in .wav format. This audio file is decoded into samples and split into frames. Such a frame is an, about 10 ms long, chunk of the original audio. Then, for each frame, various audio features, like spectral roll-off or 13 Mel-frequency cepstral coefficients (MFCCs), are computed by a python package for music and audio analysis, librosa. This way, we obtain a matrix of size (number of features computed for each frame) x (number of frames).
And here comes a crucial step: all features are averaged over frames. So, eventually, we end up with a single vector of length: (number of features computed for each frame).
It’s an important thing to notice, so I’ll repeat. Let’s assume that we compute 18 features out of each frame (e.g. 13 mfcc + some other spectral features). Then, in our ML system, each 5-second audio file is summarized as a mere single vector of eighteen numbers.
This approach has its upsides and downsides. As for obvious cons, there’s a huge information loss. It can’t be helped when we compress a complex and long sound wave to mere eighteen numbers. There’s a chance that this simplification is just too much and no classification system will be able to tell apart audio classes that could be easily heard as different before the feature extraction. On the other hand, now, when we have much simpler data representation (18 numbers instead of the whole sound wave), we need dramatically less data to train our classifier.
Eventually, there’s a Support Vector Classifier (SVC) applied to these short vectors extracted from the audio data. In the dataset included in the repo, there are four classes: crying baby, laughing baby, noise, silence. So, this is a classification task with four target classes.
Results achieved by this ML system are amazing. Train accuracy is 100% and test accuracy 98%. It would seem, that at this point, the problem is solved. We just found the repo with the solution. However, not so fast, this conclusion would be premature. There are several issues with this approach preventing us from using this model in production.
Deploying the ML model on Android
First one is that previously mentioned Support Vector Classifier is implemented in scikit-learn. What’s the problem here is that, to date, there are no good ways of deploying scikit-learn models on Android and our app is supposed to work on both Android and iOS (we’ll focus solely on Android in this article, however).
Luckily, this obstacle is easily bypassed by training a classifier in a framework which is much more compatible with mobile environments than scikit-learn. And the beauty of machine learning is that it doesn’t even have to be an SVC. The exact algorithm serving as a classifier doesn’t matter that much in ML. What really matters is the expressiveness of this model (how complex of a function is it able and willing to fit).
This is why I could carefreely replace the SVC with a simple two-layer, densely-connected neural network implemented in Keras. It also worked like a charm, achieving 96% accuracy on the test set without much of the hyperparameter-hypertuning. Keras models are easily deployable on Android thanks to a new conversion tool released by tensorflow, tflite_convert.
Deploying feature engineering on Android
This leads us to the second issue. Even if we can now transfer easily the classifier part of the system to the Android, there’s still all the feature engineering and feature extraction piece left. The python package, librosa, used to this purpose on the computer is a python package. That means it can’t be just easily used on the Android (which supports Java and Kotlin) side of things. Indeed, the people who’ve tried to deploy librosa-powered machine learning models on Android, usually end up rewriting parts of librosa to Java. Here are a very relevant blog post and corresponding repo.
We tried rewriting librosa to Java, too. It didn’t seem so difficult at first, so we thought it was going to take only one week. Unfortunately, it turned out there are no strict equivalents of numpy and scipy in Java, and librosa depends on them heavily. Additionally, the most promising to be useful for this purpose library in Java, nd4j, turned out to be too large to be used without multidex. Eventually, we abandoned this approach after a week of trying.
This problem was finally solved by switching to tensorflow’s audio recognition, described in detail here. The main idea is to, just like previously, split the input audio into frames, and calculate features for each frame. This time, these features are only mel-frequency cepstral coefficients. This way, we again obtain a matrix of size (number of features computed for each frame) x (number of frames), a.k.a. spectrogram. And here comes the trick: we treat this spectrogram as a one-channel (no RGB, just black and white) image, and input it to a Convolutional Neural Network. CNNs are known from how well they perform applied to image recognition tasks, so this approach makes a lot of sense.
Too easy data
The last issue was the dataset. In the data in our repo there are four classes, each consisting of 100 examples, each of duration 5 seconds. Part of these data is taken from ESC-50. This isn’t much data, so the extremely high accuracy of the model is somewhat suspicious. Maybe the data was simply too easy for the model? Indeed, when we examine the data samples manually (i.e. when we actually listen to these audio files), it turns out that they’re very clear-sounding, maybe even recorded in a studio. Unfortunately, our model is supposed to work in the real world, not in the recording studio, so we have to make sure that it still performs well with less ideal inputs.
This problem was addressed by downloading the recordings of crying babies and other, random sounds from youtube based on Google’s Audioset. There’s no shortage of youtube videos with crying babies. There’s a shortage of annotated youtube videos with crying babies. Even when we find a youtube video with crying baby compilation, how do we know that every (or at least nearly every) 5-second fragment of this video actually contains the sound of crying baby? Probably there are breaks in crying. We need some human labor to label which parts of the audio are relevant to us and which class do they belong to (e.g. crying, not crying). Thankfully, the strong side of Audioset is its reliable data annotation. Importantly, another desired property of downloaded data is that it’s much more real-world, noisy, less studio-like.
The data I downloaded from youtube, based on Audioset, for my crying baby detection consisted of two classes: “crying baby”, and “other”. Each class consisted of 1000 examples, each of duration 10 seconds. The “other” class was created simply from youtube audios not labeled as ‘crying baby’ in Audioset.
Let’s summarize what has happened until now. At this stage, we’ve got two datasets: a small one with clear, studio-like recordings, and a bigger one with more real-world recordings downloaded from youtube. Also, we have two categories of machine learning systems: one much less expressive with feature averaging over frames (you can think of it as averaging over time), and the second one with convolutional neural networks, from tensorflow, taking as inputs whole spectrograms and treating them as images (i.e. without feature averaging). You can read more details about the latter in this official tensorflow tutorial.
What’s going to happen when we train a CNN model on the smaller dataset? Because of much more complex data representation, the convolutional neural networks are obviously going to overfit the first, small dataset, so the training accuracy is 100% but test accuracy drops to something like 80%. On the other hand, this approach is much better suited to perform well on our real-world-like youtube data.
A surprise: different data distribution on Android
When I managed to get to 80% accuracy on the Audioset-based dataset, we decided to test the model on Android. At this stage, it should just politely get reasonable results most of the time, right? Well, it didn’t. In fact, it was recognizing everything as a crying baby. Everything was crying. People singing “happy birthday” were crying. It transpired that the audio was somehow processed by the Android app after being captured by the smartphone’s microphone but before becoming the input to the model. As a result, for the model, this distorted audio sounded completely unfamiliar, even when we could hardly notice the difference by our human ears.
Since the distribution of data downloaded directly from youtube, and the distribution of data downloaded from youtube and processed by our Android app on phone were completely different (at least from the neural network perspective), we just recorded the whole dataset (2000 x 10 second) on phone by our Android app. Then we trained the model on this freshly obtained data.
It worked out even better than expected. Not only did the model start to give reasonable predictions on Android. As a bonus, added noise and lower quality of the sound caused a regularization effect and all the overfitting was gone. This way, we obtained a model achieving 95% test set accuracy and working properly on Android.
In this article, I’ve described all major challenges and insights from my work on the crying baby detection in the project. In the future, after MVP of our app is released, I’ll get the access to data gathered from real users, with real babies and move beyond youtube.
It’s important because, for example, it might transpire that there are some common practical situations that are not well represented in the Audioset data, and therefore the model hasn’t had a chance to learn to perform fine in them yet. Then we shall make these situations prolific in our new dataset.
Who knows what other machine learning adventures await me then?