About this project
Netguru experts cooperated with CLARIN-PL and the Wrocław University of Science and Technology to create a system that would protect consumers by automatically detecting illegal clauses in terms of service agreements. The goal was to train a machine learning-powered Natural Language Processing (NLP) model proving that it is possible to automatically detect abusive clauses in legal agreements.
CLARIN (the Common Language Resources & Technology Infrastructure) is a research consortium that gathers organizations working on natural language processing.
Poland’s Office of Competition and Consumer Protection (Urząd Ochrony Konkurencji i Konsumentów) wanted to use CLARIN’s language processing capabilities to create a system that would automatically detect abusive clauses in legal agreements.
As part of CLARIN, Netguru collaborated with the Wrocław University of Science and Technology to create a model proving that it is possible to automatically detect illegal clauses in legal agreements to protect consumers from corporate abuse.
Goals and expectations
For almost every online service, it is necessary to agree to the provider’s terms and conditions. Many consumers don’t read these lengthy documents and even when they do, they often lack the legal knowledge to determine if some of the clauses included in them might not be valid or if the service provider can legally include them.
Such abusive clauses are often worded almost identically to legal ones, with only minor differences. Hence, people end up signing agreements that they shouldn’t.
Only legal experts can detect illegal clauses by closely reading the documents. However, it is a time-consuming process that is prone to human error. So, in practice, that mostly happens post factum — after the consumer makes a complaint because they’re already experiencing the consequences of signing an unfair agreement.
Poland’s Office of Competition and Consumer Protection wanted to create an automated process that would alert consumers by highlighting suspicious parts of the text.
This required creating a tool that can analyze the language of complex legal texts, detecting abusive clauses before the consumer signs the agreement.
Role of Netguru and services provided
Netguru collaborated on the project with the Wrocław University of Science and Technology (Politechnika Wrocławska), which is also part of the CLARIN-PL consortium. The goal was to create a system that would detect the abusive clauses.
In order to do this, we had to train a machine learning-powered Natural Language Processing (NLP) model to be able to classify contractual terms (parts of a document’s text) as abusive or valid. It is a binary classification task with a potentially huge imbalance of classes, with many more valid clauses than abusive ones in each document.
We were responsible for the scientific aspects of the project — dataset creation and NLP model training to validate the dataset and provide proof that it is possible to achieve the desired outcome.
How Netguru did it — approach to the project
The first step was dataset creation, a base for training the machine learning (ML) models.
We created a dataset containing both valid and abusive contractual terms, and the Office of Competition and Consumer Protection provided examples of abusive clauses from their archive. The more samples a dataset has, the better the model, which is why the process of data collection was dynamic.
The Office’s experts were delivering new examples of abusive clauses as they were going through more documents, and we continuously added these to the dataset.
Computers are incapable of interpreting text the way humans do — it has to be converted to numerical form. We found the non-abusive clauses by representing the text as multidimensional vectors thanks to NLP language models and found examples that were very far, in terms of similarity, from the abusive ones. The measure of similarity used was cosine similarity, which is often used in NLP tasks.
The valid clauses found by the model were additionally verified by the Office of Competition and Consumer Protection experts.
The final dataset had twice as many examples of valid clauses than abusive ones to represent a realistic proportion of clauses found in actual documents.
Once the dataset was ready, Netguru experts experimented with training a lot of different ML models and improving with each iteration.
- The programming language was Python.
- ML and data science frameworks used included scikit-learn, pandas, numpy, tensorflow, spaCy, PyTorch, transformers, and others.
- The models were trained using state-of-the-art language models, obtaining the best results using transformer models for sequence classification — namely HerBERT created by the Allegro research team.
Results of the cooperation
- Together with scientists from the Wrocław University of Science and Technology, the NLP team at Netguru was able to demonstrate that it is possible to create a system that could successfully identify abusive contract clauses in most cases. Our best mode obtained a macro average F1 score of 0.87. F1 scores range from 0 to 1, and our result is considered very good.
- The dataset we used was passed on to the client to serve as a benchmark for the systems that will be created in the future.
- Furthermore, Netguru and the University will co-author a scientific paper to present our findings so that they can be used in future approaches to unfair contractual terms recognition.