[LWM] NLP: Bag-of-words

28. září 2021/Alex Alderman

Welcome to another blog-isode of Learn with me — a weekly educational series by Gauss Algorithmic. We take cutting-edge technological concepts and break them down into bite-sized pieces for everyday business people. Today’s topic will be the second in our series on Natural Language Processing, or NLP.


Welcome back to Learn with me, where we break down complicated tech topics so simply that even a marketer like me can understand. 🤷‍♂️

We started our series on NLP last week, and we will continue by discussing two interconnected concepts: Bag-of-words and Text classification. Both of these concepts are used for spam filters or other forms of automated document sorting.

Let’s look at a (very real) email you might find in your spam folder.




Like we talked about last time, machine learning algorithms don’t typically process text without turning them into numerical data first. The bag-of-words method is one way to do that. 

This process takes all the words in a body of text (for our example, we’ll just use the subject line of an email), and simply counts how many times different words appear. This creates a number set called a vector.



Notice that a Bag-of-words vector can include words that AREN’T in the text (“hello”). This is because the absence of words can also be important when sorting, or classifying, a text.

Text classification

A spam filter will compare the bag-of-words vector of an incoming email to what a typical email might look like. For example, the profile of a typical spam email might have many more instances of the word “sexy” than your average email that you want in your inbox.

(Your inbox experience may vary, we don’t judge. 😉)



Machine learning algorithms will go through a huge number of emails to start learning patterns of which emails belong in your inbox, and which belong in your spam folder. The Bag-of-words is one method they use for this kind of text classification, but there are other methods too which we’ll talk about next week.

The result? No hoot nights for me. 🥺



How can this help my business?

Lot’s of different types of text exist, and machine learning algorithms can do something with them automatically. This is called document automation. The cost for setting this up can quickly become less than the man-hours that this process is replacing.

A few example:
⚖️ Sort health records into low- and high-risk
✉️ Redirect info@ emails to the right team
💫 Organize YEARS of old documents

Tell us about your business use case and we will let you know how we can use machine learning and/or document automation to save you money.

Líbí se vám článek? Sdílejte jej.

Přečtěte si dále

Z novinek: AI umí předpovědět, kde bude příští hodinu pršet

Z novinek: AI umí předpovědět, kde bude příští hodinu pršet

14. 10. 2021Čtěte více o Z novinek: AI umí předpovědět, kde bude příští hodinu pršet
Automatizace vám usnadní rutinní práci s textem [NLP pro laiky]

Automatizace vám usnadní rutinní práci s textem [NLP pro laiky]

29. 9. 2021Čtěte více o Automatizace vám usnadní rutinní práci s textem [NLP pro laiky]
[LWM] Entity recognition 2

[LWM] Entity recognition 2

29. 9. 2021Čtěte více o [LWM] Entity recognition 2

Máte zájem o naše služby?

Kontaktujte nás

Sbíráme anonymní data a měříme, abychom náš web mohli dále vylepšovat. Souhlasíte se sběrem cookies?

AnoNe, více informací