In a world full of communication, in different mediums, contexts, and languages, understanding it all is a global challenge.
For researchers working with large collections of text, many are turning to modern solutions such as CLARIN-PL for its Natural Language Processing solutions.
In this episode of Disruption Talks, Jan Kocoń (Emotion Analysis Coordinator), Ewa Rudnicka (Polish-English WordNet Coordinator), and Maciej Piasecki (CLARIN-PL Coordinator) from CLARIN-PL tell us more. They discuss their roles, the challenges they face, what NLP is, and what CLARIN-PL is capable of.
What is Natural Language Processing?
Natural Language Processing, or NLP, is the automatic manipulation of natural language by software. Natural language is the way we communicate with each other through speech and text.
Natural Language Processing in machine learning aims to build machines that can understand and respond to text or voice data in a similar way that humans do. This is a big challenge because of the complexities of language and our understanding and across multiple languages.
That’s where CLARIN-PL comes in. CLARIN, the Common Language Resources & Technology Infrastructure, is a pan-European scientific infrastructure that helps researchers work with large collections of text.
Robert Kostrzewski: We can assume that modern machine learning techniques work best for English texts. What are the tasks you can perform with Polish texts?
Jan Kocoń: All the tasks we do are designed to get the computer closer to understanding the language the way that humans do. This typically begins with the most low-level text processing, which involves splitting the text into sentences and words. Then we assign grammatical categories to words such as nouns or verbs.
Those texts then enter the higher level of processing, which involves extracting information from texts. For example, one is recognizing that Warsaw is the name of a city and New Zealand is the name of a country.
With CLARIN-PL, we have currently developed most of the basic tools and services for Polish, like we do with English. Additionally, we also prepare our own general-purpose language models based on deep neural networks dedicated for Polish.
The grammar rules in Polish are quite complex. Is that a challenge as well?
Jan Kocoń: This was a big problem when we were developing these tools. At the beginning of our work with Polish, we had to create solutions that were also based on the loose syntax. The looser syntax you have, the more freedom in creating the sentence you have. But this means you need more rules to understand it in the tool.
What is the difference between multilingual models and language-agnostic models?
Jan Kocoń: The main difference is that multilingual models are trained in different languages. But they cannot transfer knowledge in a simple manner from one language to another. Language-agnostic models, on the other hand, are built using a different method.
Normally, there is a set of sentences, tens of thousands or even millions of sentences, which are their own translations in different languages. So, the major difference is that the language agnostic models are transferring text to some shared hyperspace. The benefit of this is that you can train your model in Polish and then use it for multiple languages.
Ewa, could you introduce us to the WordNet project?
Ewa Rudnicka: WordNet is a kind of lexical-semantic database. It’s a machine-readable dictionary of words. The Polish WordNet is currently the largest in the world and perhaps the richest network.
It’s a bit like a thesaurus because we have words roped into synonym sets, but the relations of synonyms are not straightforward. The relationship between words can be a specific meaning, or it can be a general meaning. For example, types of animals or makes of cars.
A common misunderstanding is that it’s an automatic translation tool. Instead, it’s a Polish NLP that’s very corporate-driven. Relations between words are directly extracted from the information in texts which is a unique method of matching words.
Thanks to the manual mapping to the English Princeton WordNet, we have a Polish-English resource. For some parts, we have direct translations or word equivalents, but others are more complex relations. It’s a great resource for translators, teachers, learners, and NLP specialists.
If it’s harder to analyze sentiment in some languages, what about other languages like French, German, and Spanish?
Jan Kocoń: There are a number of studies for these languages. But the major difference is that for each language, we have different resources. So it's pretty hard to compare them because the complexity of the task is usually dependent on what kind of data we have and what kind of domain it is.
Some domains are more controversial. For example, an opinion about a hotel being expensive, that’s fine. But for political discussion, it becomes trickier. It’s not hard from a language perspective, but it is from the other perspective.
CLARIN-PL also contributes to conferences and scientific papers. Could you explain a bit about that?
Maciej Piasecki: CLARIN is a research infrastructure which means that it tries to facilitate research done by wide groups of scientists and those in humanities and social sciences. Our mission is to help other people do science.
On the other side, we are a research group that has various projects we’re working on. Polish is a starting point, but when you are working in Natural Language Processing, there’s more of an emphasis on the practical dimension. We must also show that our method works for other languages, especially English.
How does Polish WordNet benefit Polish science?
Ewa Rudnicka: We have already supported a number of projects. One of them is the project funded by the National Science Center. It’s a project on the Yiddish language. It’s focused on the analysis of the Yiddish language, English language in relation to both Polish and German.
One of the outcomes was a special dictionary of the Yiddish language linked with the Polish WordNet, the Princeton WordNet, and the German WordNet. It was a comparative and historical linguistics project, and the results were published in the International Journal of Lexicography.
What about the users of the Polish-English Wordnet? Could you elaborate on how people are using it?
Ewa Rudnicka: We currently have around 1,000 downloads of the PL WordNet database per year. It’s both for individual users for personal projects and learning languages, but also for the big players.
PL WordNet is available on a fully open license for both scientific and business purposes, so it’s a part of online dictionaries and is quoted by Google Translate as an official source of data.
What’s the difference between me as a machine learning developer in NLP in business and your approach to your daily work?
Jan Kocoń: I think that the major difference is the different expectations in the science world and the business world. In science, what matters is the solution that gives you the highest score among other available solutions. Business is not often as focused on quality alone. For example, a solution that is 80% accurate might be more attractive than one with 83% if, for example, it was ten times faster.
At Netguru, we typically try to get people focused on a single project at a time, but what about CLARIN-PL?
Jan Kocoń: That’s the challenging thing because we deal with several projects along with our main project. We have a lot of tasks we’re dealing with simultaneously, so an important part of the job is switching between tasks and coordinating all of them.
However, by looking from different perspectives and managing with different scientific teams, you get fresh ideas from one solution and apply them immediately in another solution. So that's actually very beneficial from that perspective.
This discussion is part of our Disruption Talks recordings, where we invite experts to share their insights on winning innovation strategies, the next generation of disruptors, and scaling digital products. To get unlimited access to this interview and many more, sign up here: www.netguru.com/disruption/talks