Machine learning is one of the hottest topics in modern software development. Still, it is a fairly young technology, which is undergoing rapid change and development. We wrote this article to aid you in selecting the right flavor of ML for your project, showcasing the top frameworks along with their upsides and shortcomings.
Scikit-learn is a Python library used for machine learning. More specifically, it’s a set of – as the authors say – simple and efficient tools for data mining and data analysis. The framework is built on top of several popular Python packages, namely NumPy, SciPy, and matplotlib. A major benefit of this library is the BSD license it's distributed under. This license allows you to decide whether to upstream your changes without any restriction on commercial use.
OpenCV (Open Source Computer Vision Library) is a library for Computer Vision with machine learning modules that offers C++, Python, and Java interfaces and supports Windows, Linux, Mac OS, iOS, and Android. As the creators say, it was created for computational efficiency and a strong focus on real-time tasks – it’s blazing fast as a result. Its main advantages include its cost (free for commercial use under the BSD licence), efficient use of RAM, and incredible speed. These, however, come with a tradeoff – the tech can be hard to get used to for beginners.
Dlib is, to directly quote its creators, a modern C++ toolkit containing ML algorithms and tools for creating complex software to solve real-world problems. Although its website might look uninviting, dlib offers an impressive range of features, including exhaustive documentation, support for machine learning, numerical, and graphical model inference algorithms, as well as a GUI API, and a number of other perks and utilities. One popular use of dlib is face recognition (including face landmark detection). Since the framework has been in development 2002, it can be used for many other things, although doing so might require significant time investment in order to become familiar with its many features.
The essence of Gensim is aptly captured by its tagline – “topic modelling for humans”. A topic model is a statistical model used to discover abstract "topics" in documents. While the concept was initially developed for text mining, topic models have been used to detect structures in other types of data, such as genetic information. The main advantages of Gensim, as mentioned by its creators, are its clarity, efficiency, and scalability. As for its disadvantages, Gensim is not a general-purpose framework, so using it for things like image recognition is out of the question.
MLlib is, as implied by its fairly straightforward name, a machine learning library, maintained as a part of Apache Spark. It is intended as a tool for big data processing; one could also call it an open-source cluster computing platform.
It is fully interoperable with NumPy library and R language and runs on multiple platforms, including EC2, Hadoop YARN, Mesos, or Kubernetes. MLlib can access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
Surprise (Simple Python Recommendation System Engine) – whose name can be annoyingly hard to google – is “a Python scikit created for the purpose building and analyzing recommender systems”. This means that it can be used to do things like recommending items to users based on their past purchases or ratings or matching users with people sharing similar tastes. Its advantages are, as the creators say, great documentation, full control over your experiments, and easy implementation. As for the cons, the kit is fairly limited in its applications, so make sure that it’s a good fit for your project before investing any significant resources into it.
Pandas (Python Data Analysis Library) is not precisely a machine learning library, but t it's widely used in machine learning community. It boasts many benefits, most of which are related to the fact that in pandas, your data has labels. This means that it’s much easier to keep track of your data than using, say, NumPy. The labels and data come in an R-style data frame, which makes it easier to get into for developers familiar with that language. Another benefit of pandas is its good I/O capabilities – data can be easily pulled into (and extracted from) your pandas data frame.
Its main (and best) use case is data wrangling (sometimes referred to as data munging), that is, processing and transforming raw data from one format into another for analytical purposes.
TensorFlow is, according to its inviting tagline, an open-source machine learning framework for everyone. The tool was initially developed for internal use at Google, but was released publicly under the Apache 2.0 open-source license on November 9, 2015, and has gained enormous popularity since then.
The framework is extremely powerful, which means two things: you can do a lot of things with it, but the other thing is that using it for simple things may be a huge overkill. One thing TensorFlow is great at is deep learning. One major example is RankBrain, an advanced keyword processing tool used by Google in its search engine.
PyTorch, a Python-centric deep learning framework, is all about flexible experimentation and customization, as well as strong GPU acceleration. Some example applications include word-level language modeling, time sequence prediction, or training imagenet classifiers. A major advantage of this framework is its support for dynamic computation graphs (DCGs), which is a boon for uses such as linguistic analysis. As for its downsides, PyTorch is still fairly immature, so it may be more difficult to find information about it and to recruit experienced devs.
Keras, whose name is a fairly obscure reference to a passage from Homer’s Odyssey (and which means horn in Greek), is a Python deep learning library. Its self-touted advantages are user-friendliness, modularity, easy extensibility, and the ability to work with the well-loved programming language that is Python. Moreover, it supports doing computation with both CPUs and GPUs, and there is a whole bunch of applications made for it that you can use in your project.
Drawbacks? Some claim Keras hard to customize, that its data processing tools are subpar, and that its behavior with regards to low-level issues can range from quirky to incomprehensible, if you lack the required advanced knowledge.
There are many ML libraries on the market. When deciding on which one to choose, you should consider the application (such as image recognition versus text analysis), your preferred programming language (C++ and Python are the popular options), the maturity of the framework, and community/corporate support (there are offerings from such industry giants as Apache or Google).
If the choice seems daunting – well, that’s because it is. Machine learning, although it was conceptualized in the late 1950s, has gained a lot of popularity only recently, mostly thanks to the newly-found abundance of data and a dramatic increase in the availability of cheap computational power. Still, it’s a paradigm that might be particularly hard to comprehend. If you’re looking for advice – or a team of experts to do the ML heavy lifting for you – don’t hesitate to get in touch. We’re always happy to share our knowledge or offer our expertise.