With artificial intelligence becoming smarter and smarter, the interest in NLP (natural language processing) has also been growing rapidly. Utilizing and developing the technology that helps machines understand and process human language has become a hot topic in many industries, because it has the potential to finally fill the gap in human-computer interactions.
Gensim was designed specifically for semantic analysis and unsupervised topic modeling, using raw and unstructured digital (plain) text. It’s often used for discovering similarities in documents (by Mindseye, Amazon or 12k), exploring customer complaints (by CapitalOne), and detecting large-scale fraud (by Cisco). Plus, the Gensim Word2Vec module makes it a great framework for any machine learning processes that involve word embedding in NLP, such as document classification or processing academic publications.
The framework is praised for being very efficient, clearlystructured, fast, and scalable. It can easily handle large data sets and data streams. Aside from Word2Vec, Gensim also features efficient implementations of some very popular algorithms, like Latent Dirichlet Allocation (a method of topic modeling) or Random Projection (a method of dimensionality reduction). That said, Gensim is also a highly specialized framework – which means it’s not suitable for all purposes. So, if you’re looking for one general framework for everything, just keep searching.
Scikit-learn (also known as sklearn) is a framework that offers an easy way of implementing regression, clustering, and classification for text data. Sklearn is great for classifying news articles into a number of predefined categories, such as politics/lifestyle/sports/ etc. or for analyzing newsgroups posts on different topics. Many businesses use sklearn for NLP-related projects, such as PeerIndex (for classifying tweets) or Data Publica (for categorizing companies using their website communications).
Scikit-learn provides various algorithms for building ML models and intuitive classification methods. It is also precisely documented and extremely beginner-friendly, allowing engineers to see the effects of their work without having to spend too much time on it. But on the other hand… many frameworks are better suited for more sophisticated NLP projects such as part-of-speech tagging or text preprocessing.
SpaCy is considered to be the fastest NLP (and NLP-only!) framework in existence. It comes with a lot of pre-trained models to solve many problems straight out of the box. It can be used, for example, to identify money entities in news articles and extract both the value and the currency from the text. SpaCy has been leveraged by some big players on the market, such as Airbnb or Quora.
The framework is very specific and fairlyspecialized, providing in-depth documentation and active support. It features built-in word vectors and is a more object-oriented library in comparison with others. SpaCy is also pretty easy to learn for newcomers, as it requires just a single highly optimized tool for each and every task. This leaves no room for hesitation. On the flip side, many developers happen to complain about the lack of flexibility and no support for many languages.
Stanford CoreNLP is primarily written in Java, but it’s also accessible through multiple Python wrapper libraries, created and maintained by the Python community. CoreNLP integrates natively with other NLP libraries developed by Stanford and makes a great foundation for building microservices because it can be run as a web service. It’s a fast annotator for arbitrary text, definitely used a lot in production. On top of that, CoreNLP provides very accurate techniques for tagging and parsing.
This NLP toolkit is very flexible and easily extendable since it has APIs for most programming languages and provides support for a great number of the most popular human languages. It’s super useful when building complex NLP systems. On the other hand, the specific client-server architecture may not be very efficient for simpler projects or prototyping.
I know, I know… NLP is only one of many different purposes for which TensorFlow can be used. But as deep learning becomes more and more popular – not just within academic research but in production as well – the need for stable and capable frameworks also grows, and TensorFlow is definitely of them. Moreover, it’s perfectly geared towards NLP. A good example here is the recent rise of recommender systems, such as DeepCoNN or TransNet that use NLP to model users by analyzing the product reviews they’ve written and then modeling products by analyzing the reviews that a particular product has received. TensorFlow is also used for voice recognition and text-based applications, such as Google Translate.
Tensorflow comes with great documentation and a lot of useful guidelines. It features Tensorboard – a nicely designed visualisation tool, supports distributed training and provides model serving. However, despite being backed by a community of devs, the framework is not very newbie-friendly, and Python is the only fully supported language.
Natural language processing requires thorough consideration before you can even start to process anything. If you’re completely new to NLP and expect quick results, I would suggest betting on Sklearn, and then maybe proceed with more advanced frameworks.
But if you’ve got the basics covered and want to start working on a more sophisticated project, then CoreNLP or TensorFlow may be better options. During the selection process, it is also pretty important to know exactly what you’ll be doing within machine learning, because some libraries (like Gensim) are fairly specialized, and there’s no way that you could use it for both determining document similarities and, for example, image recognition.