Data Labeling - Best Practices for AI-Based Document Processing

Photo of Łukasz Ruczyński

Łukasz Ruczyński

Updated Jul 11, 2024 • 17 min read

Setting up an effective data labeling, or data tagging, service increases the potential performance and usefulness of your machine learning model. Sharpen your AI algorithms, claim back valuable time for your business, and develop a more convincing business strategy. Here’s everything you need to know.

Data labeling is essential to making any data preparation worthwhile. Your invoices, reports, documents, or any other text data can rarely be used by any machine learning without undergoing the data labeling process first. Some machine learning models will suffer a significant loss in performance if data has not been labeled correctly, while others will be impossible to run at all.

Building a machine learning solution is often described as being data-centric. It transforms your data into actionable suggestions so that you can improve the performance of your business. But, to get the most out of the thousands of complex algorithms that are waiting in the wings to be used by your business, you need organized data sets.

This is where data labeling works hand-in-hand with modern AI models. Once you have trained your automatic data labeling system, the process becomes quicker and the results are experienced faster.

What is data labeling?

Put simply, data labeling is the process of assigning desired information to each data sample.

Raw data, in the form of images, text, etc., is given informative labels according to their content and context to advise machine learning models. If, for example, a photograph shows an image of a cat, or specific words are identified within a piece of text, a meaningful output can be returned by a machine learning algorithm after reading the corresponding data label.

The data tagging process typically starts by asking a development team to assign labels manually. This can range from simple binary options (yes or no response to a question) to identifying individual pixels where the specified object (e.g. a cat) can be seen. Once an example dataset has been prepared and labeled, a machine learning model can use it to learn how to process as yet unlabeled data to get the desired output.

In a nutshell, this is one of the main goals of labeling - to prepare high quality training data for your model. It can be used further on, for instance in intelligent document processing.

Types of data labeling

The type of data labeling that you choose to implement will be guided by the machine learning model you intend to use on the data. Here are three of the most popular data annotation types, intended for image, text, and audio labeling.

Computer vision

Computer vision data labeling helps machines understand visual data. This can take four different forms:

  • Image classification - assigning visual data tags (binary or multiple) according to their content
  • Image segmentation - isolating objects from their backgrounds, enables detection of all images within a dataset that contain a specified image
  • Object detection - highlighting boxes within images using rectangular bounding boxes, can highlight and label multiple different objects within each image
  • Pose estimation - interpreting the pose or expression of a person in an image by detecting and correlating key points.

Natural language processing (NLP)

Natural Language Processing is the analysis of human language and speech. The abilities of NLP have greatly improved thanks to AI and deep learning. NLP can take the form of:

  • Entity annotation and linking - identifying and tagging names within text while distinguishing nouns from verbs, prepositions, and so on; entities can then be linked to data repositories and meaning within a text can be clarified
  • Text classification - assigning labels to blocks of text as a whole, rather than individual words, labels can be determined by sentiment or topic
  • Phonetic annotation - analyzing where commas, stops, and other punctuation marks are used in the text to influence meaning.

Audio processing

Audio processing first identifies and tags all background noise from an audio file. Then, it develops a transcript of the recorded speech with the help of NLP algorithms. This data can also be used to help with speaker identification models and linguistic tag extraction.

Labeled vs unlabeled data

Data without additional information is by definition unlabeled. Not having any additional information doesn’t make the data worse; in fact, it is much easier to acquire and store as it is cheaper and less time-consuming to create. The distinction between labeled and unlabeled is clearer when you first consider the machine learning model you intend to use.

Some models cannot be trained on unlabeled data, whereas others can. But, on the whole, the vast majority of models need labeled data, which means that the data labeling process is indispensable. The latter helps businesses derive actionable insights, while unlabeled data can be used to reveal new data clusters that can then be meaningfully interpreted as they are.

Every machine learning model needs to process data to gain its predictive power. Models that work using labeled data are called supervised learning algorithms, while the ones based on unlabeled data are the unsupervised learning algorithms.

To have a better idea of what kind of problems these different types of algorithms can solve, have a look at the below table.

Examples of supervised vs unsupervised machine learning algorithms problems
Supervised learning Unsupervised learning
Is there a car in the picture? Signal if there is some anomaly in stock price
How many people are in the picture? Split these documents into coherent categories
What is the answer for a question? Learn grammar of the language
Is a stock going to go up or down?
How much does a house cost?
Find all organizations in the text
Translate this sentence into French

As you can see from the examples of NLP below, many problems require data to be labeled. In both cases, NLP intelligently processes the text and provides an answer or solution. Here are a few examples.

Question answering


Finding an answer for a question (Question Answering, with answer shown in bold labeled part):


What is a balance sheet?
Answer delivered by the NLP model (bolded) A balance sheet is a financial statement that reports a company's assets, liabilities, and shareholder equity. The balance sheet is one of the three core financial statements that are used to evaluate a business.
Name entity recognition
Task Finding an organization in a text (Name Entity Recognition, with answer shown in bold labeled part)
Answer delivered by the NLP model (bolded) Facebook doubled its revenue last year.

Labeling can be done manually or automatically. In the case of manual labeling, the labeler inspects every piece of data and tags it accordingly, using data labeling software.

Suppose that your business operates in a supply chain. Its role is to accept orders from different clients and then submit the orders to the hub closest to the desired destination. As there are lots of different orders and lots of different products, no single employee can know the full list of products by heart.

Checking each order against the database is time-consuming. So, we need to automatically extract information about brand, product number, and some additional characteristics from text in order to speed up the process.

For example, take the following text: “Apple iPhone 12 Pro Max - 256GB - Blue”.

In your database, you have many orders that may be almost identical. The task of a human annotator is to highlight the desired information in the text.

Order before labeling:


Order after labeling:


An important question arises: how many samples should be labeled in order to get good, reliable results? As you might expect, the correct answer doesn’t exist.

It’s dependent on a number of technical details, but you should ensure you have as much data at your disposal as possible. Explained below are some general guidelines that can help you navigate your specific case.

Best practices of labeling data for machine learning

Data labeling, at first, might seem simple, but finding the right solution for you might make it harder to implement. Before you begin, assess the size, scale, and length of the data labeling project for your specific case. Then follow the key steps outlined below to set you on the road to success.

Data labeling process

The role of data is twofold. The first role is to let an algorithm fit its parameters. The second role is to measure the performance of the trained model.

These two unique roles mean that your data should be separated into two sets. Usually, a proportion of 10:90 or 20:80 is applied, where the larger chunk of data corresponds to the training set.

In deep learning, this proportion is more flexible. The general rule of thumb is that the more data you have, the smaller proportion of it needs to be set aside for testing. Start by focusing your labeling efforts on the test data.

Choose the right data samples and data labeling software

It is rarely possible to label all your data, so you need to single out the best samples. They should be versatile; using the same example multiplied hundreds of times won’t bring anything new, and your algorithm will very quickly stop learning.

For example, if your documents belong to many categories and the final processing will be applied to all of them, then it is advisable to have some representatives from each category.

The second thing to consider is the choice of data labeling software. There are many commercial products that might speed up your work, especially in the case of computer vision, where the object you are labeling is an image. Many software options have the intelligence to automatically detect objects to be selected if hinted at.

A good example is V7. Alternatively, there are open-source products that provide good performance; for example, at Netguru, we successfully use Label Studio for the purpose of labeling data for NLP projects.

Measure your model’s performance

Labeling can be split into sections. The performance of your model will be determined by how many data sets you have and how many labelers are working on them.

As your model learns, it may require additional custom data sets. For example, if you label 1000 samples and use them for model training, and then you do the same for another 1000 samples (2000 samples in total), you may still see a considerable improvement in the model’s performance. This means that it is still hungry for data and has a lot to learn.

At some point, the improvement is no longer as impressive. Given the fact that labeling requires time and money, you might decide to stop the labeling when the margin of improvement gets significantly small. If you remain unsatisfied with the performance, it is the role of Machine Learning Engineers, or the human-in-the-loop (HITL) to figure out why a certain threshold cannot be surpassed. HITL is useful for using human judgment to fine-tune your model.

There might be a need to change your model or approach. You might consider some changes in test data size or quality. If making these changes helps, you might go on to label more training data.

Benchmark solutions

In every machine-learning project, evaluating your solution against the benchmark is a necessary step. You need to know what results can be obtained, and expected, from as little effort as possible. In the case of NLP projects, you have thousands of pre-trained models at your disposal, against which you can check how well your own model needs to perform.

Of course, there might be some structural requirements. There might not be a model that extracts exactly the information we need. But even if it is the case, you should have a look at how close you can get to your desired end-point with ready-to-use models. Measure the performance of these models, and use this as a reference point for your initial iterations.

Work organization

The number of human annotators affects the pace of labeling linearly. It is a task that can be easily divided. It might be a good practice to have at least two labelers for whom labeling is loosely defined, for example in the case of translation.

In some cases, you might want to have more than one person label the same sample, just to see if they agree and measure how flexible each label should be: it’s important to remember that a human annotator might introduce their own bias.

External data labeling services

There are many companies that can cover data labeling tasks, if provided with exact and exhaustive instructions. This could be a good option if you don’t want to dedicate the time of your development team to the project. One of the cheapest options is to use Amazon Mechanical Turk, where experts from across the world can offer their services.

Why label data in the first place?

The simple reason for labeling data is this: labeling can often dictate whether your initial problem can be solved at all, and almost always helps you to increase the performance of your solution. Data labeling gives you greater flexibility in building machine-learning algorithms specifically for your business’ Business Intelligence (BI) and analytics purposes, and increases the value of their results.

But, if you have less training data at your disposal, or want to reduce the amount of training your model needs, there is another option: pre-trained data labeling models. It is, however, important to know the risks of using these models from the outset: potential worse performance and output.

Many models are already pre-trained on huge amounts of data. The best example is Google’s BERT (Bidirectional Encoder Representations from Transformers), which has predefined parameters. It has already inspected millions of samples of data, and therefore understands the grammar and syntax of the language.

But, BERT does not have the specific knowledge required for your business. The process of re-training a pre-existing model, but this time with your data, is called fine-tuning. Fine-tuning is generally quicker, less expensive, and less data-intensive than starting from scratch.

There’s also a theoretical reason for always using your data to fine-tune: with your data, working for your business, the model should never decrease in performance. If you’re an advanced reader, you can find detail in this research paper.

Get started with data labeling

Data labeling should be always considered as an option. The process of labeling is not difficult, particularly when you’ve considered the points outlined in this article. Whether you choose to devise your own model, or simply re-train a pre-existing one, data labeling can help you identify those unique quirks in your business data and leverage them in a powerful way.

Fine-tuning any model over time always increases performance, and helps you bring even a pre-trained data labeling tool up to the desired standard. There are many tools available to help you, including specialists in NLP projects.

So, now there is no excuse: use your data; it will make all the difference.

Photo of Łukasz Ruczyński

More posts by this author

Łukasz Ruczyński

Łukasz works as a Machine Learning Engineer at Netguru.

We're Netguru

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency.

Let's talk business