Welcome back to Learn with me, where we break down complicated tech topics so simply that even a marketer like me can understand. 🤷♂️
Today we’re going to do something a little different. I want to catch everyone up on today’s largest challenges for NLP translation models. We can dive into the technical aspects a bit more at a later date, but for now, let’s look at the top five machine translation issues NLP engineers are trying to solve.
#1 Vocabulary mismatch
Remember that just like any other machine learning, translation models look at lots of data (in this case, texts) and recognize patterns. And like any other model, the data you put in completely determines your output. In other words: “Garbage in, garbage out.”
This means that the text you use to train the translation model on should be similar to the desired output. Want it to translate legal documents? Make sure to build the model on lots, and lots of legal documents! You don’t want to train it on Twitter posts, or else you’re going to have a bad time during a contract review.
“Upon violation of the terms of this agreement, the contractual relationship between The Employee The Company will be immediately terminated. lololololz get rekt!”
#2 Unknown symbols
We’re still at a stage in technology where an obscure emoji can blow up a translation model — especially if it’s something that it has never encountered before. It's not the model’s fault … it just has no idea what to do with that information.
Common, everyday symbols and emojis are fine. But when you start messing around with some of the rarer ones, then you’re just playing with 🔥.
#3 Boring output
This one I don’t blame on the tech. Most people are bad writers. (I’m not complaining — it’s the main reason that I have a job. 😎) Again, the model is limited by the data it was trained with. That means it’s much more comfortable with common, boring sentences — and often that is what it will replicate.
Pro tip: Do you want better translation results? Type in short sentences. Use simple vocabulary. It makes it easier for the translation model. Yes, this is an example. 😉
#4 Avoiding catastrophic translations
Ok. This one sounds dramatic. 😱 A catastrophic translation error is when the improper translation completely changes the meaning of a sentence. Obviously it’s not great when you can’t understand a translation, but it’s a much bigger problem when you’re getting the wrong information without realizing it.
Source: “Wash your hands, or you will catch the coronavirus”
Translation: “Shake hands, or you will catch the coronavirus”
Source: “From that point, turn right and drive 20 kilometers”
Translation: “From that point, turn right and drive 20 miles”
Source: “They are the worst band!”
Translation: “Cold Play are the worst band!” (yes, this is 100% real 🤣)
#5 Resource management
Here’s an issue that most users wouldn’t ever think about, but it can take up to 300 MILLION computations to correctly choose a single word in a translation. That’s a LOT of resources — especially for user-level hardware. And when you’re dealing with millions (and millions) of requests per day, well, even Google needs to set a limit.
You may have never thought of this before if you’re not a developer, but it’s not enough to just figure out JUST ANY solution to a technical problem. Your solution also needs to be resource-efficient enough to not blow up your servers or use up half the energy available on the planet. 🌎💥
When we help companies with translation solutions, one of the most important services we offer is acquiring the data they need. We know data. We love data. We know where to find data. 🔍
Want to integrate language translation into your project? We’d love to help! Write to us, and we can tell you how!