Revealing Sensitive Documents with Entity Recognition

In this blog series, I share an overview and our team’s hands-on experience with research and implementation of a system to evaluate the sensitivity of documents based on its textual content. Our approach is to identify the sensitive entities of the selected types and therefore we are dealing with a problem of Named Entity Recognition.

This part brings an overview of the approaches applicable to Entity Recognition. We also focus on a data collection for NER, that many find the most challenging for practical applications.

In the upcoming second part, we’ll guide you through the integration of the technologies that we describe here into a functional product to evaluate the sensitivity of a textual document.

Named Entity Recognition

Given a sequence of words, NER methods aim to identify subsequences, such that the words of a subsequence represent an entity of a given type. What makes the problem particularly interesting is that a set of true entities is not finite, and a set of candidates - all subsequences of words of a given text - is extremely big.

The following example illustrates the challenge of detecting addresses (green) and organization names (blue) - the two classic types of entities - in a text:

We see that the format of addresses is not consistent among languages, and a qualified human reader relies mostly on a knowledge of the names of organizations and locations when separating the relevant and irrelevant segments of text.

We follow with a short survey of methods that are widely used for dealing with NER, and we have tried ourselves too. We order them by their complexity, from the simplest, most intuitive and naive, to the complex, but expressive neural models.

Word n-grams classification

We fix a window size n and subsequently extract from a text all subsequences of length n: for n=2 we get {[Sender: Planet], [Planet Express], [Express Limited], ...}. We classify each n-gram to selected entity types, e.g. by the most common representant.

This approach, though easy and fast, needs a huge training set to cover all the possible n-grams in a text at least once. Optionally, we can introduce Levenstein distance of n-grams and classify to the closest category. This would, however, work poorly with e.g. new company names or addresses. Additionally, it would be tricky to identify entities of the length other than n.

Conditional Random Fields

The probabilistic method with such a fancy name is mostly a conditional probability model over words and their entity tags. Here, knowing that we have seen “Planet” as ORG on a previous position, we are interested in what are the chances, that the tag will be ORG also for the following word: “Express”?

Is this chance higher, than that the tag will be something else, e.g. LOC, or no tag (marked as O)? We estimate the probability for each tag X from our training data.

You might argue, that there will be plenty of word pairs, that have not yet been seen in training data but only appear later and you are right. That is right: the method is quite demanding for the size of training data.

We can ease the demand by “relaxing” the condition in this probability: if we have not seen a word pair in tagged sequence before, we look only on the tags, not words.

So, if we guessed the previous word as ORG, we are asking: what is a chance that the next one will again be ORG? That’s easy, that’s

This concept remained a golden standard for text tagging for many years, and even the best-performing sequential neural models, that we describe below still use the concept of conditional probabilities internally.

Regular Expressions

‍If the entities always follow the same pattern or a set of patterns, we have won already, since we can use Regular Expressions. The best part of it is that we do not need to even care about the training data, as we construct the patterns ourselves. This approach was indeed the most successful one for several entities: email addresses, phone numbers, account numbers.

In practice, however, the texts are parsed dirty and regular expressions often break on a single extra space or comma in the text. Additionally, as we develop the regulars to be more robust, we often can not avoid but include more false positives.

Including the corner cases while not ruining the precision often lead to creating over-complicated patterns. Note the one we have developed to match the phone numbers, as one of the most regular entities:

Recurrent Networks (RNN)

A concept that has brought a new wind of progress into a field of NER in recent years, lays in a newly-attained ability to train the neural networks containing inner loops. Such architectures allow the network to remember a history of the read text together with its tags and thus accommodate the relations that span a length of several words.

For example, in “Sender: Planet Express Limited” the network still remembers the intro word: “Sender” when reading the word Limited. Further, it can accommodate, that ORG entity ends often with “,”’ (comma) if that is usually the case.

Fig.: Recurrent network architecture, with recurrent cells (e.g. LSTM) on the middle layer. Taken from Huang et al.

Fig: Long-Short-Term Memory cell of the recurrent network.

Recurrent networks are super expressive models able to capture significantly complex sequential relations. However, they also inherit the curses of all the neural network models: they are sensitive to under- or overfitting when not training optimally long, as a number of the trained parameters by far higher than a number of training samples. They are difficult to debug, thus even the dully-repetitive simple errors must often be erased in post-processing. In general, they require a lot of training data to avoid overfitting, although to our surprise, in our use case, they can capture the easier conditional, or word pattern dependencies already on 100 of occurrences.

Sequential Network and word embeddings

‍In addition to a direct mapping of words (e.g. over sparse one-hot encoding) to the network inputs, the more common practice is to use dense semantic representations of words, called word embeddings. The word embeddings project a set of words into a space of selected size. Such projection respects the relative distance of words so that similar pairs of words are projected the closest possible and the dissimilar ones as vice versa.

The similarity of words is a matter of great research attention. A well-known Word2Vec considers the words as similar if they share the same context. More sophisticated methods, like ELMo or BERT, in addition to the context, consider also an inner structure of the words, like uppercasing, interpunction or shared suffixes/prefixes. This is especially useful for languages with inner word logic, like German, or Slavic languages.

How should you pick the entity detection method to go with, when implementing your own use case? It is crucial to reflect your volume of training data, required production speed, interpretability and the output accuracy. Additionally, each of the methods can perform significantly well, if its flaws are balanced by other methods, filling the main method’s weak points.

In [our approach](link), we’ve combined more of the mentioned methods to smoothen their flaws.

Data for NER

Training especially the more sophisticated methods requires a great amount of the texts with associated entity tags. When considering a selection of training data, we advise spending a decent amount of time considering the heterogeneity of training texts – especially including most of the corner cases and balancing the dominant patterns of entities is crucial for the system to generalize properly.

If 90 % of your training texts come from one origin, but you expect more distinct sources in production, your system will overfit the 90 % and will make significantly more errors in production on the data anyhow not similar to the 90 %. There are methods to balance such categories, but you need to identify the problem.

Whenever it is possible to estimate the characteristics of the production data requests, try to align as much variance of it, as possible in the training set. E.g. in our case, it was clearly easier to reach the variance of the ordinary administrative documents on corporate storages within the closed set of entities types, than trying to learn any entity type in an arbitrary text such as Shakespeare’s poetry.

Fig: Whenever possible, estimate the variance and cardinality of patterns in your production environment and try to cover it in your train set, just unlike in the illustrative example above.

Creative hacks to deal with the small dataset

From the experience we attain, it seems that an inability to obtain a reasonably big data set for a reasonable amount of time and money scares off many aspirators from utilizing machine learning at the very beginning, not only in the language processing.

Before burning the huge amounts for your own data labelling, consider the following:

1. Seek for similar data sets: maybe you can cover most of your production variance by the publicly-available data sets. Look into Kaggle Datasets, Google Dataset Search, or government public databases – especially useful for sensitive entities data, or datasets of your target localization.

For example, in the case of Entity Detection, seek for the biggest-possible tagged corpora in your objected languages, publicly available from universities or linguistic research institutes.

Further, in another use-case of visual document classification, dealing with only tens of labelled samples, we’ve trained a standalone classifier on a set of US Administrative data set having thousands of black-and-white administrative scans and transferred the pre-trained model to our classes. Dealing with training only a single adjacent layer of the network, the model converged quickly and has reached a usable error rate.

2. Attempt to transform your problem into a different task, with more data sets available, or easier-to-do labelling. With Entity Detection, you can transform your task into the classification of n-grams of words, where there might exist a database for some types of entities, that you can exact-match, to find the ground truth. Reversely, for classification of documents, it might be useful to initially detect entities in your documents and classify it based on the more informative contained entities.

3. Use your own structured database: if you have been dealing with the same task in the past, there’s a good chance that you have the tags of older data in SQL database, or application logs. In the case of Entity detection, if so far there’s a dedicated person manually identifying entity values e.g. for categorization, or further search, consider full-text matching the identified entities back with the source text and so to create your own tagged training corpus with your own entities.

4. Combine several sources of data into a single data set. Including more than one source of data results in a more difficult fit of your system, but enforces better generalization properties. It might allow to reasonably increase the number of training epochs or to oversample the training data. In case of our Entity Detector, we have combined the three data sources into the final data set: Czech NER corpus (CNEC), English Annotated Corpus for NER, and our own labelled data, to allow for bilingual support.

It is rarely the case, that all the types of entities that are of your interest will be present in the training corpus. When collecting tags for other entity types, in your own, or pre-tagged corpus, it is tempting to use unsupervised methods, like Regular Expressions. Beware that such approaches might likely cause the neural model to overfit, and would mimic the errors of the worse-performing model. Rather consider the aforementioned database(s) tagging, or check for the similar data sets elsewhere, e.g. on Kaggle. In general, always remain suspicious for overfitting, that might be the main enemy of the production performance (like illustrated on the graph above).

{Check out how,}(po vydani druhej casti) [in our case](link), we have separately collected the multiple sources of tags, and combined them into singular entity matches.

‍