Complex tasks require complex technology
The way the application works on the surface is quite straightforward. A user uploads a photograph with some cranes in it. Then the algorithm looks for cranes and detects every single pixel of the object. Finally, another algorithm – similar to the ones used in photo editors like Photoshop – replaces those pixels with new ones. Eventually the user gets a beautiful picture without unwanted cranes.
This may seem simple on the surface, but requires complicated computations inside. Even 5 years ago it wouldn’t have been possible to perform such tasks. Fortunately, these days we can use Deep Learning, a dynamically evolving branch of Machine Learning. Just like in traditional Machine Learning, it uses a statistical process to guess the rules inherent in a dataset. It uses much bigger models – the number of parameters is usually an order of magnitude larger than in traditional approaches – which allows us to solve much richer and complex problems.
A good example showing how the algorithms work is the MNIST dataset. It is a simple computer vision dataset used to recognise handwritten digits. Each image of a handwritten digit is broken into pixels and pixel intensities form the image’s features. The network goes through a training process in order to build a function that will map the inputs to the desired outputs. Simply put, the algorithm needs to learn how to map a set of pixels that describe an image into a digit from a 0 to a 9. Today, this is the “Hello, World!” of Deep Learning – most people get an almost perfect solution during their first few hours with Deep Learning. However, when it was introduced it was considered challenging.
Our algorithm needed to perform a much more difficult task. Instead of simple images, we have complex photos of city landscapes. Instead of just recognising a digit, we need to select the precise shape of cranes. The bar was set way higher.
Convolutional Neural Networks explained
Deep Learning uses neural networks to perform such tasks. Neural networks are information processing paradigms inspired by the way biological nervous systems, such as the brain, process information. For image recognition we use a special type of neural networks: Convolutional Neural Networks (CNNs). CNNs are artificial neuronal networks that are able to pick out and detect patterns and make sense of them.
A neural network always receives its input as a list of numbers between 0 and 1. What makes CNNs special is that the data they receive are structured the same way as an image – it’s usually RGB pixels on a 2-dimensional grid. This allows them to easily exploit features like geometrical proximity or color similarity.
Convolutional layers contain so-called “filters”. Filters detect that some particular pattern is visible in some region of the image. Filters in initial layers work with very simple features, like ridges, vertical lines, etc. These are passed to the successive layers, where more abstract features are detected. As we go deeper, we start recognising textures or geometrical shapes, up to things like eyes or ears in layers closer to the top. Check out this video that explains how CNNs work.
Our Deep Learning model to detect cranes had 37 layers and was developed using Python. The biggest obstacle in creating it was data annotation. Describing datasets for training was a very time-consuming task, and we can see that there is still a lot of space for improvement on the technology side.
Photoshop done automatically
The second part of building a system that would remove undesired objects from pictures was a much easier task to accomplish. Once the image recognition algorithm detected all the pixels that include parts of any cranes, we needed a solution that would replace these pixels with new ones. Obviously, not just any pixels, but ones that would fit in the picture.
We used an algorithm that is widely used in many photo editors. In Photoshop it’s called Content-Aware Fill, and its main job is to fill the selected object with surrounding pixels and blend them. This ready-made solution fulfilled its function, so we decided against building a new custom algorithm.
Creating a cover
The inner workings wouldn't have any use if not for a user interface. Thus, we also needed to build the frontend of the application. The product was created during a three-day hackfest which the Netguru team took part in together with the partners from MicroscopeIT who designed all the Deep Learning algorithms for the app. The team was determined to show a working product (despite the fact that the algorithms were not yet ready to give precise results) at the end of the event. Therefore they needed a framework that would enable them to write the frontend code efficiently.
Adam Romanski, our Frontend Developer opted for Vue.js, a trending technology for building user interfaces. Its promise is to deliver products faster, without compromising on performance and reliability.
With the use of this technology, it took him approximately twenty hours to deliver the frontend of the application. In any other framework it could taken twice as much time. Vue has a rich ecosystem, and adding new packages required much less boilerplate code compared to similar frameworks. It was easy to integrate it with the backend of the application, so the showcase version of PrettyCity was done by the end of the Hackfest.
Enjoy pretty cities on your pictures
We are excited to release the working product to the public. On top of that, we’re happy to cooperate with the guys from MicroscopeIT, who brought great expertise and skills in machine learning technologies, while we could contribute to the front of the app. If you want to learn more about other ML projects, contact us. Check out the app and stay tuned for the next edition of the PrettyCity app!