The comprehensive NLP preprocessing

4 minute read

keras

Preprocessing is essential for natural language processing. The text is a series of characters and is unstructured, so it is difficult to process as it is. In particular, web text contains noise such as HTML tags and JavaScript code. Such noise should be preprocessed and removed to achieve the expected results.

I will describes here the types of preprocessing and its power in natural language processing.

After you read this, you’ll get:

Strong understanding of preprocessing
well-structured order note of NLP implementation
Cheat code-kit for NLP preprocessing

Chapter 1. 3 Type of NLP Tasks

Always it’s better to start from Goal. So the output of NLP is classified in the following 3 types:

Continuous Value in the Regression problem
Class in the Classification problem (Sentiment Analysis, Spam Filter, Topic Model..)
Text in a Communication application (Chatbot, Machine Translater by using Seq2Seq…)

The hottest topic is 3., which is defined as Automation of Communication, like Chatbot and Machine Translation, Question Answering. It can be substitute of human work just like Customer Service. Very interesting.

Chapter 2. 5 steps of NLP preprocessing before Modeling

If I simplified the entire process of NLP as much as possible, it can be a breakdown like this:

Cleaning Text Data: from removing number, symbol, and to stemming
Bag of words and create Vocabulary: from Unstructured to tokens
Word Embedding with Semantical context: acquire Numerical features
Padding sequences by Fixed length
Reshape the form into N dimension

If you are not familiar with computer science, you should just put in your mind this biggest core idea:

We human beings succeed to convert natural languages into a numeric form with “meanings”! As you know the computer only can process everything by numbers. So the key to NLP implementation is how you change text data into meaningful features before passing them to machine learning models or neural network.

Chapter 3. The simplest cheat code-kit

Don’t worry, the process of NLP has always pretty similar format! You need to implement 3 times, you’ll grasp clear enough pictures. So here is original text which we will preprocess. Let’s get started!

MLB Cincinnati Reds T Shirt Size XL
Razer BlackWidow Chroma Keyboard This keyboard...
AVA-VIV Blouse Adorable top with a hint of lac...
Leather Horse Statues New with tags. Leather h...
24K GOLD plated rose Complete with certificate...

Lowercase and Miss-spelling normalization

text_df["text"].apply(lambda x: " ".join(x.lower() for x in x.split()))
print(text_df["text"].head())

output

mlb cincinnati reds t shirt size xl
razer blackwidow chroma keyboard this keyboard...
ava-viv blouse adorable top with a hint of lac...
leather horse statues new with tags. leather h...
24k gold plated rose complete with certificate...

Non-alphanumeric data removing: number, symbol, emoji, HTML tag…

text_df["text"].str.replace(r"\d+", "")
text_df["text"].str.replace('[^\w\s]','')
text_df["text"].str.replace(r"[︰-＠]", "")
print(text_df["text"].head())

output

mlb cincinnati reds t shirt size xl
razer blackwidow chroma keyboard this keyboard...
avaviv blouse adorable top with a hint of lace...
leather horse statues new with tags leather ho...
k gold plated rose complete with certificate o...

Tokenization: from sentence to word

nltk.download('punkt')
from nltk.tokenize import word_tokenize
text_df["text"].apply(word_tokenize)
print(text_df["text"].head())

output

[mlb, cincinnati, reds, t, shirt, size, xl]
[razer, blackwidow, chroma, keyboard, this, ke...
[avaviv, blouse, adorable, top, with, a, hint,...
[leather, horse, statues, new, with, tags, lea...
[k, gold, plated, rose, complete, with, certif...

Stop Words removing

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
text_df["text"].apply(lambda x: [item for item in x if item not in stop_words])
print(text_df["text"].head())

output

[mlb, cincinnati, reds, shirt, size, xl]
[razer, blackwidow, chroma, keyboard, keyboard...
[avaviv, blouse, adorable, top, hint, lace, ke...
[leather, horse, statues, new, tags, leather, ...
[k, gold, plated, rose, complete, certificate,...

Stemming and Lemmatization

Stemming

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
text_df["text_nonstop"].apply(lambda x: [stemmer.stem(e) for e in x])
print(text_df["text"].head())

Lemmatization

nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
text_df["stemmed"].apply(lambda x: [lemmatizer.lemmatize(e) for e in x])
print(text_df["text"].head())

output

[mlb, cincinnati, red, shirt, size, xl]
[razer, blackwidow, chroma, keyboard, keyboard...
[avaviv, blous, ador, top, hint, lace, key, ho...
[leather, hors, statu, new, tag, leather, hors...
[k, gold, plate, rose, complet, certif, authent]

Tokenization and Feature Selection: Bag of words, Tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=0.03)
tfidf = vectorizer.fit_transform(text_df["text"])
tfidf.toarray()

output

array([[ 0., 0., 0.70710678, 0.4554374, 0., 0.],
     [ 0., 0.62951441, 0., 0.70710678, 0., 0.],
     [ 0., 0., 0.4554374 , 0.62951441, 0.62951441, 0.],
     [ 0., 0.34567441, 0., 0.90121218, 0., 0.]])

Word Embedding: Word2Vec

from gensim.models import Word2Vec
model = Word2Vec(text_df["text"], size=150, window=10)
model.train(text_df["text"], total_examples=len(text_df["text"]), epochs=10)
model.wv.most_similar(positive=["home"])

output

[('depressed', 0.7399601936340332),
 ('marge', 0.6840592622756958),
 ('terrific', 0.6683255434036255),
 ('hammock', 0.6648209095001221),
 ('bongo', 0.6537353992462158),
 ('suspicious', 0.650877833366394),
 ('brad', 0.6480956077575684),
 ('dr_hibbert', 0.6459894180297852),
 ('crummy', 0.6455820798873901),
 ('tab', 0.6413768529891968)]])

Conclusion

“Better Data > Fancier Algorithms“

It’s a quote by EliteDataScience.com which you often hear in this field. Especially in NLP, we have to make sure that computer under stand our language or context, meaning. Otherwise they cannot replace our daily communication task like Chatbot do.

Take your time and tidying up as much as possible!

References

Share on

Twitter Facebook Google+ LinkedIn

Akira Takezawa

The comprehensive NLP preprocessing

After you read this, you’ll get:

Chapter 1. 3 Type of NLP Tasks

Chapter 2. 5 steps of NLP preprocessing before Modeling

Chapter 3. The simplest cheat code-kit

Lowercase and Miss-spelling normalization

Non-alphanumeric data removing: number, symbol, emoji, HTML tag…

Tokenization: from sentence to word

Stop Words removing

Stemming and Lemmatization

Stemming

Lemmatization

Tokenization and Feature Selection: Bag of words, Tf-idf

Word Embedding: Word2Vec

Conclusion

References

Share on

Leave a Comment

You May Also Enjoy

Goal for 2022

Three Pillars in my life

You became better person, better life

13 LIFE SHIFT