It’s the foundation of many apps that enable users to automatically identify artists, instruments, or simply recognize someone’s voice. How does the algorithm work in practice? We had a chance to implement audio processing with machine learning on iOS and Android mobile devices. See how our journey went.
As part of a bigger project at Netguru, we had the chance to work on implementing audio classification using Machine Learning and deploy trained models on mobile, targeting both Android and iOS devices. Audio classification itself is an interesting domain. Relatively strict restrictions imposed by the comparatively low computational power available on mobile devices mean that our work was quite challenging and, ultimately, very rewarding. Here are the challenges we faced, the problems we encountered and our solutions to those problems.
Audio Processing Theory
At first glance, audio classification might seem to be quite different from image classification. However, in recent years, there has been a strong convergence towards applying computer vision approaches to audio classification. By virtue of being able to represent audio signals as images, we can take advantage of many well-known image classification techniques to solve audio classification problems.
Nearly 200 years have passed since the French mathematician Fourier proved that certain functions can be represented as infinite sums of harmonics, meaning that we can represent a function as an infinite sum of sine and cosine functions, or approximate it with arbitrary precision by using a finite sum. His work has had a major impact on various scientific fields, most notably signal processing. On the basis of Fourier’s work, we can transform the representation of an audio signal as a function of time into a representation of its frequencies and vice versa. This relatively trivially stated fact has been and notably will continue to be a major contribution to various scientific achievements ever since Fourier first studied it. It is also on this basis that we are able to represent audio as an image. Specifically, by using discrete Fourier transform algorithms, we can create an image corresponding to the presence of various frequencies in an audio signal across time.
Audio Spectrogram of a Crying Baby
Audio Spectrogram of a Laughing Baby
The above spectrograms cover a frequency range up to 20 kHz, which is a well-known upper limit of the audible frequencies of an average person. Human hearing is quite a tricky thing – it turns out that humans don’t perceive audio frequencies linearly. A 1000 Hz note is not perceived to be twice as high of a note as a 500 Hz note. Additionally, sounds generated by humans tend to have a more limited frequency range than the range present in our spectrograms. Specifically, in the case of crying babies, the frequency range starts around 1000 Hz and extends up to 6000 Hz. This is why instead of applying traditional image classification methods directly to our spectrograms, we transform our spectrograms once more. By extracting what is known as Mel-frequency cepstral coefficients, we address the aforementioned discrepancies between human perception and sounds generated by humans and the original spectrograms we obtained via repeated use of the fast Fourier transform algorithm.
We chose to implement our model architecture in Tensorflow (TF), thanks to the wide range of supported algorithms directly inside of the framework. Specifically, we were able to directly use functionality provided by TF to compute spectrograms and Mel-frequency cepstral coefficients, and, finally, add the convolutional layers related more to image classification for our model architecture. However, having a trained model didn’t mean that our work was done. We had to find ways to be able to deploy our trained model on both Android and iOS devices.
On Android, it was straightforward. Thanks to the TF support, we can directly make use of a trained model. Once the weights are frozen, we can add the protobuf file representing the model to our Android app as an asset, which allows us to load it during the execution of the app and run the inference directly inside the app.
On iOS, the story is a bit more complicated. There are various solutions when it comes to deploying models on iOS. Specifically, in the case of deploying models built and trained with TF there are two possible solutions: Using TF on iOS directly or converting the model to CoreML before deploying.
In our case, both proved to be unfeasible without a lot of additional work. Due to the lack of support for specific TF Kernels and Operations that our model architecture used we had to take a few extra steps. Namely, we had to implement two key operations in CoreML by writing our own custom ML layers for AudioSpectrogram and MFCC – the two key operations that our model architecture uses – before feeding standard convolutional layers. Although initially frustrating, this challenge turned out to be extremely rewarding and allowed us to dive deep into the hardware accelerated computing on Apple’s system. Ultimately, we were quite impressed by the computational capabilities of the iPhone.
Thanks to the great Accelerate framework, we were able to implement the missing operations. What’s more, we were able to do so without compromising the performance of our model. Using the hardware-accelerated Signal Processing functionality allowed us to reduce the inference time by a factor of around 30, which means that we’re able to not only run inference using our model architecture in real time –we can so while keeping the computational cost and, consequently, the battery use at a minimum.
By far, the biggest technical challenge was deploying the trained model on iOS. Implementing the missing TF operations in CoreML not only required a lot of coding but also a lot of research into Apple's ecosystem. While Swift is an amazing language, it’s not a Machine Learning engineer’s language of choice and we all had to take a deep breath as we taught ourselves to program in Swift. Doing so with the direct objective of implementing low-level hardware accelerated Machine Learning operations may not be the easiest way, but it ultimately turned out to be rewarding and quite educational.
The challenges we had to overcome tell an important lesson about modern software development. Writing code that ultimately adds value as software providing some functionalities to its users requires more than the expertise in a particular area or mastering a particular programming language. Sometimes as a developer, or as a Machine Learning engineer, there aren’t great libraries available and the only way to achieve the desired outcome is broadening one’s skill set and building up an in-depth knowledge. Moving forward, I expect more and more demand for developers and engineers that have broad knowledge spanning across domains, such as a combination of mobile and machine learning knowledge, instead of increased demand for developers with a highly specialized profile.
It also shows that a solid understanding of programming languages and being comfortable with working on different platforms, with different frameworks and different languages is a highly desirable skill in the field of Machine Learning and Data Science. I expect that this will only become more relevant moving forward.
Having worked on a lot of computer vision projects as a Machine Learning engineer, working in the field of Audio Processing has been interesting and quite surprising. To see the techniques and methods traditionally used in computer vision problems work so well in audio processing shows the potential Machine Learning has in solving challenging problems in a multitude of domains. Additionally, working on a Machine Learning project for mobile devices comes with its own set of constraints that we don’t normally face as Machine Learning engineers, such as the quite rigid computational power restrictions and the need to work across various libraries.