Lemmatization in NLP and Machine Learning – Built In

Lemmatization is one of the most common text pre-processing techniques used in natural language processing (NLP) and machine learning in general. Lemmatization is not that much different than the stemming of words in NLP. In both stemming and lemmatization, we try to reduce a given word to its root word. The root word is called a stem in the stemming process, and its called a lemma in the lemmatization process. But there are a few more differences to the two than that. Lets see what those are.

Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm would reduce the word better to its root word, or lemme, good.

In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word. There are different algorithms used to find out how many characters have to be chopped off, but the algorithms dont actually know the meaning of the word in the language it belongs to. In lemmatization, the algorithms do have this knowledge. In fact, you can even say that these algorithms refer to a dictionary to understand the meaning of the word before reducing it to its root word, or lemma.

So, a lemmatization algorithm would know that the word better is derived from the word good, and hence, the lemme is good. But a stemming algorithm wouldnt be able to do the same. There could be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or just retained as better. But there is no way in stemming that can reduce better to its root word good. This is the difference between stemming and lemmatization.

More on Machine Learning: An Introduction to Classification in Machine Learning

As you can probably tell by now, the obvious advantage of lemmatization is that it is more accurate than stemming. So, if youre dealing with an NLP application such as a chat bot or a virtual assistant, where understanding the meaning of the dialogue is crucial, lemmatization would be useful. But this accuracy comes at a cost.

Because lemmatization involves deriving the meaning of a word from something like a dictionary, its very time consuming. So most lemmatization algorithms are slower compared to their stemming counterparts. There is also a computation overhead for lemmatization, however, in most machine learning problems, computational resources are rarely a cause of concern.

More on Machine Learning: Multiclass Classification With an Imbalanced Data Set

Well, I cant answer that question. Lemmatization and stemming are both much more complex than what Ive made them appear here. There are a lot more things to consider about both the approaches before making a decision. But Ive rarely seen any significant improvement in efficiency and accuracy of a product that uses lemmatization over stemming. In most cases, at least according to my knowledge, the overhead that lemmatization demands is not justified. So, it depends on the project in question. But I want to put out a disclaimer here. Most of the work I have done in NLP is for text classification, and thats where I havent seen a significant difference. There are applications where the overhead of lemmatization is perfectly justified, and in fact, lemmatization would be a necessity.

Read the original here:
Lemmatization in NLP and Machine Learning - Built In

Related Posts

Comments are closed.