Ever since Apple introduced Siri a few years ago to rival Android’s voice assistant, speech-to-text has been a staple tool in Apple’s and Google’s mobile ecosystems.
After the initial introduction, Apple opened up the API to developers, allowing them to write apps that make heavy use of speech-to-text transcription. As part of a project for one of our clients, we implemented a speech-to-text transcription feature that takes advantage of the Apple transcription API. To our client’s delight, we were able to successfully integrate transcription into the app. The only shortcoming was the lack of any punctuation marks in the transcriptions produced by Apple’s SFSpeechRecognizer.
After internal discussions between our Mobile and Machine Learning team, we decided to propose a solution to this problem to our client. The idea that we came up with involved using Machine Learning to train a model capable of adding punctuation marks to unpunctuated text. Given our client’s commitment to respecting their users’ privacy, deploying a model on the device itself instead of deploying it in the cloud was the only viable option. This meant we would have to get the model running with Core ML.
To our delight, everyone involved was more than happy with this proposed solution, and we launched a pilot project to produce a working proof of concept. Here’s how we successfully delivered an ML-based solution for automatic sentence punctuation.
Deep Learning and the Data Collection
As is the case with any major project centered around Deep Learning, the first hurdle that has to be overcome is securing access to good data, since the quality of the final model is a direct consequence of the quality of the dataset that is used during the training phase.
In NLP, access to data is generally not as big of an issue. During the early phases of this project, we came across the Europarl dataset, which turned out to be perfect for our use case, and we were able to use it directly with absolutely no issues creeping into the project later down the road.
Computers do not have the ability to read words – they can only operate on numbers. More often than not, the quality of NLP depends on how the words are converted to numbers. We can simply make a dictionary of words and assign them a number randomly – this approach, called “bag of words”, was very popular and achieved some successes back in the day.
Nowadays, there is a new standard way to encode words – instead of looking at language as a combination of letters we instead represent words as vectors. These vectors allow us to work with context and relationships between words, which wouldn’t be possible if we were to only look at each word as a particular combination of letters with a randomly assigned number.
The recent advances in contextual language representation modeling, such as BERT developed at Google, have shown a lot of promise. However, we chose to stick to the more traditional and established methods and made use of Global Vectors for Word Representation (GloVe).
However, word embeddings are only the first step in language modelling – they provide a smart way of representing words as vectors. Now we had to design a neural network that would learn how to add punctuation for a given set of words represented as vectors.
Among the most successful neural network architectures are approaches that make heavy use of convolutional layers. Convolution has proven to be a key ingredient in the recent advances in many computer vision related tasks, such as image or audio classification.
However, when working with text data, convolutional layers are suboptimal. Convolutional layers work whenever we’re trying to identify patterns based on relative position, regardless of the absolute positioning of these patterns. This is because of the translational symmetry of convolutional layers.
On the other hand, automatic sentence punctuation happens to be almost the exact opposite. Instead of the relative position and proximity of particular words, the model needs to be capable of capturing the global context and absolute positioning of words.
LSTM layers, unlike other layers in a neural network, can process input sequences by looking at the data sequentially while keeping track of long-term dependencies in the input. Using a memory mechanism, the model can interpret a word not just based on the word itself but based on all previous words in the sequence. In the slightly more complex case of bidirectional LSTMs, we essentially duplicate this process in order to interpret a sequence of words in the original and in the reverse order.
Although LSTMs are capable of keeping track of long-term dependencies in sequences, they frequently end up over prioritizing the impact of words with close proximity, meaning that they can prioritize less relevant information closer to the current position in a sequence over information further away in the sequence.
To address this, we introduced an additional attention mechanism directly following the bidirectional LSTM layers to improve the model’s ability to identify words that are important to the overall sentence structure, even if they are placed relatively far from the punctuation that can be inferred from them.
By merging the output of our attention mechanism with the output of the bidirectional LSTM layers we are able to improve the predictive performance of the bidirectional LSTM architecture that we started out with.
To train our model, we implemented a training routine in Python using Keras and TensorFlow. Thanks to the great compatibility of these two frameworks with Core ML, converting our trained model to Core ML and deploying it as part of the iOS app did not cause any issues and took no more than a few hours.
Any machine-learning feature on one of the big mobile platforms is still overall a challenge due to the lack of full support for frameworks such as TensorFlow and PyTorch on Android and iOS alike.
The additional constraints on computational power present on mobile hardware require a careful weighing of the model’s performance. We are very excited that we could achieve the initial goals we set for ourselves, and we intend to focus further on automatic punctuation.
In particular, we are looking forward to exploring other language representation models and studying how they impact the performance of automatic punctuation.