Dealing With Noisy Labels in Text Data – KDnuggets

With the rising interest in natural language processing, more and more practitioners are hitting the wall not because they cant build or fine-tune LLMs, but because their data is messy!

We will show simple, yet very effective coding procedures for fixing noisy labels in text data. We will deal with 2 common scenarios in real-world text data:

We will use ITSM (IT Service Management) dataset created for this tutorial (CCO license). Its available on Kaggle from the link below:

https://www.kaggle.com/datasets/nikolagreb/small-itsm-dataset

Its time to start with the import of all libraries needed and basic data examination. Brace yourself, code is coming!

Each row represents one entry in the ITSM database. We will try to predict the category of the ticket based on the text of the ticket written by a user. Lets examine deeper the most important fields for described business use cases.

If we take a look at the first two tickets, although one ticket is in German, we can see that described problems refer to the same software??Asana, but they carry different labels. This is starting distribution of our categories:

The help needed looks suspicious, like the category that can contain tickets from multiple other categories. Also, categories Outlook and Mail sound similar, maybe they should be merged into one category. Before diving deeper into mentioned categories, we will get rid of missing values in columns of our interest.

There isnt a valid substitute for the examination of data with the bare eye. The fancy function to do so in pandas is .sample(), so we will do exactly that once more, now for the suspicious category:

Bundled problems with Office since restart:

Messages not sent

Outlook does not connect, mails do not arrive

Error 0x8004deb0 appears when Connection attempt, see attachment

The company account is affected: AB123

Access via Office.com seems to be possible.

Obviously, we have tickets talking about Discord, Asana, and CRM. So the name of the category should be changed from Help Needed to existing, more specific categories. For the first step of the reassignment process, we will create the new column Keywords that gives the information if the ticket has the word from the list of categories in the Text column.

Also, note that using "if word in str(words_categories)" instead of "if word in words_categories" would catch words from categories with more than 1 word (Internet Browser in our case), but would also require more data preprocessing. To keep things simple and straight to the point, we will go with the code for categories made of just one word. This is how our dataset looks now:

output as image:

After extracting the keywords column, we will assume the quality of the tickets. Our hypothesis:

We made our new distribution and now is the time to examine tickets classified as a potential problem. In practice, the following step would require much more sampling and look at the larger chunks of data with the bare eye, but the rationale would be the same. You are supposed to find problematic tickets and decide if you can improve their quality or if you should drop them from the dataset. When you are facing a large dataset stay calm, and don't forget that data examination and data preparation usually take much more time than building ML algorithms!

outlook issue , I did an update Windows and I have no more outlook on my notebook ? Please help !

We understand that tickets from Outlook and Mail categories are related to the same problem, so we will merge these 2 categories and improve the results of our future ML classification algorithm.

Last, but not least, we want to relabel some tickets from the meta category Help Needed to the proper category.

We did our data relabeling and cleaning but we shouldn't call ourselves data scientists if we don't do at least one scientific experiment and test the impact of our work on the final classification. We will do so by implementing The Complement Naive Bayes classifier in sklearn. Feel free to try other, more complex algorithms. Also, be aware that further data cleaning could be done - for example, we could also drop all tickets left in the "Help Needed" category.

Pretty impressive, right? The dataset we used is small (on purpose, so you can easily see what happens in each step) so different random seeds might produce different results, but in the vast majority of cases, the model will perform significantly better on the dataset after cleaning compared to the original dataset. We did a good job!Nikola Greb been coding for more than four years, and for the past two years he specialized in NLP. Before turning to data science, he was successful in sales, HR, writing and chess.

More:

Dealing With Noisy Labels in Text Data - KDnuggets

Related Posts

Comments are closed.