The main motivation of our research is to empower the clients using Sodat Analytics, an analytical tool to inspect, manage and protect the corporate storages. These shared, human-organized storages are often weakly organized and the contained document formats do not even allow for indexing and searching its inner content. The sensitivity of such documents is thus known only by the owners of the document if they still even remember the content.
Fig: Sensitive document example
We identify that what we perceive as a sensitivity of a text relies essentially on the presence of particular types of textual entities: personal full names, geographical addresses, company identifiers, phone or personal numbers. Thus, we are dealing mostly with the problem of Named Entity Recognition (NER), that we have introduced in the previous part.
We suggest combining the output of the neural NER detector with other available metadata independent from the tagged corpus: particularly, we are “validating” and “extending” matches of the neural tagger and regular expressions, based also on the matches with databases.
We have observed that, although the network rarely misses the entity completely, it rather fails to properly separate it or brings up the false positives, where the learned pattern turns up randomly, for example, a sequence of two or more Capital Case words would often falsely trigger the Fullname tag. Apparently, this is difficult even for humans, without acknowledging the database of human names, e.g. for Indian names, especially for weakly-contextual texts, such as formal contracts, or invoices:
“(...) aforementioned arrangement of Zoidberg & Young, Inc, New New York City, signed here by Aakar Achale, acclaimed Consulting Representative, and (...)”
Here, both the context and structure of “Consulting Representative” fit both the Fullname context and internal structure, but can be eliminated from false positives by considering a database of given names. The same applies for “New New York City” – However, a database of geolocations can assure the transition from PER category to LOC. Similarly, a database of valid surnames can assure the elimination of random uppercase false positives, like in the first case.
Fig: Detection pipeline
The process of detection integrates the result of neural entity recognition into a more robust pipeline, by collecting a vertical consisting of each word of input. The vertical of one word consists of the assigned NER tag and binary matches of Regular Expression Matchers and Database Matchers.
Matching each word to ~1 million of geolocations, or ~1 million of Czech surnames is quite complex using standard structures, but using the static Trie tree structures, we get below 0.5 sec for documents of 10k words, when matching each word against three 1m databases. We use DB Matchers analogically also to match the selected set of keywords that we know that must reside close to some entities.
Each entity type consists of an implementation of the vertical matcher. The entity matcher has two conditions: pivot condition and incremental condition. As the name suggests, the pivot condition first selects the major candidate words, from which the entity candidate grows locally, one word at a time, in directions (forwards/backwards/both) satisfying the incremental condition. The condition has a tolerance of a selected number of non-matching words, that it accepts when growing.
- First, the word embeddings are retrieved, from the words longer than one. We do not utilize any preprocessing, to persist words formal meta information. Then, ELMo embeddings are inferred. The ELMo model that we use is transfer-learned from the model trained on a full English Wikipedia corpus. We update the model with a NER-tagged corpus of Czech, English and a sample of our desired document types. We use a two-layered, bidirectional network, with LSTM units.
- Subsequently, we match the regular expressions for the applicable entities. We have found some entities to be more accurate to match using regular expressions (e.g. phone numbers) and thus we combine the predictions of the both into the words’ vertical.
- Some entity matches can be verified so that they contain the reference of real-world entities. Address entities are a decent example of such: each address must contain a name of the city, to be valid. Where applicable, we verify these using Database Matchers.
- After the word vertical is created, we feed the vertical to each entity matcher.
- Once the candidate is separated, it is validated according to a minimal and maximal length. This assures that if multiple entities are relocated subsequently, they are split so that each pivot belongs to at most one entity.
As you might notice, the implementation consists of several empirical decisions. Particularly, choosing the conditions for Vertical Matchers is crucial for the most accurate possible result. These conditions, as well as post-conditions of length, can be also automatically inferred since a set of conditions for each entity is finite. However, as we have relied on a minimal amount of own-tagged texts, we have found the well-performing rules empirically, since it is far more time feasible than tagging sufficiently large volume of our own texts.
We see that the selection of the rules relies significantly on the characteristics of the data set. For example, considering possible sensitive documents, we see that Full names are often followed by the address of the subject. Thus, we reassure the correct bordering of this type by choosing to add a condition:
PER.increment_condition = (NERtype is PER or Fullnames.DB is 1) and not LOC
Fig: Words vertical is a key structure allowing to combine all available information into a robust prediction. The vertical is an input for rule-based Matchers of each entity.
Below is a confusion matrix of the aforementioned neural NER tagger on a stratified ⅕ test split containing a set of 20 contracts, that we aim to classify as sensitive, together with a list of entity types that we detected. NUM type consists of all the numerical types that we aimed to detect (born numbers, phone numbers, VATs, etc.) aimed only for secondary support of their regular expressions. An interesting observation is that emails, matched primarily by Regulars, reach a decent performance only on ~55 samples of the category.
|avg / total||0.84||0.86||0.85|
Note that, the results for our own Czech data are still lower than for English original corpus, where we reach an average precision of 0.82 and recall of 0.84. This suggests that the further extension of our data sets would most likely boost the performance of the neural tagger.
One another counterintuitive observation we got is that by adding more entity types, even when not interesting for our use-case, can increase the performance of other entities. For example, adding TIME, that has many occurrences in foreign corpora, has increased a quality of PER (Full names) predictions, that often relocate close to TIME in the sensitive texts. This is most likely a reflection of the aforementioned Conditional Fields, that reflect the proceedings of entity types.
The table above shows the vertical for a sample text from the organic text of Czech invoice, with the Named Entity tags as retrieved by our sequential tagger, with corresponding matches to our databases, in the form that is subsequently processed by Incremental Matcher. Pivots for the found entities are coloured.