The comprehensive NLP preprocessing

4 minute read

keras

Preprocessing is essential for natural language processing. The text is a series of characters and is unstructured, so it is difficult to process as it is. In particular, web text contains noise such as HTML tags and JavaScript code. Such noise should be preprocessed and removed to achieve the expected results.

I will describes here the types of preprocessing and its power in natural language processing.

After you read this, you’ll get:

  • Strong understanding of preprocessing
  • well-structured order note of NLP implementation
  • Cheat code-kit for NLP preprocessing

Chapter 1. 3 Type of NLP Tasks

Always it’s better to start from Goal. So the output of NLP is classified in the following 3 types:

  • Continuous Value in the Regression problem
  • Class in the Classification problem (Sentiment Analysis, Spam Filter, Topic Model..)
  • Text in a Communication application (Chatbot, Machine Translater by using Seq2Seq…)

The hottest topic is 3., which is defined as Automation of Communication, like Chatbot and Machine Translation, Question Answering. It can be substitute of human work just like Customer Service. Very interesting.

Chapter 2. 5 steps of NLP preprocessing before Modeling

If I simplified the entire process of NLP as much as possible, it can be a breakdown like this:

  1. Cleaning Text Data: from removing number, symbol, and to stemming
  2. Bag of words and create Vocabulary: from Unstructured to tokens
  3. Word Embedding with Semantical context: acquire Numerical features
  4. Padding sequences by Fixed length
  5. Reshape the form into N dimension

If you are not familiar with computer science, you should just put in your mind this biggest core idea:

We human beings succeed to convert natural languages into a numeric form with “meanings”! As you know the computer only can process everything by numbers. So the key to NLP implementation is how you change text data into meaningful features before passing them to machine learning models or neural network.

Chapter 3. The simplest cheat code-kit

Don’t worry, the process of NLP has always pretty similar format! You need to implement 3 times, you’ll grasp clear enough pictures. So here is original text which we will preprocess. Let’s get started!

  0  MLB Cincinnati Reds T Shirt Size XL
  1  Razer BlackWidow Chroma Keyboard This keyboard...
  2  AVA-VIV Blouse Adorable top with a hint of lac...
  3  Leather Horse Statues New with tags. Leather h...
  4  24K GOLD plated rose Complete with certificate...

Lowercase and Miss-spelling normalization

text_df["text"].apply(lambda x: " ".join(x.lower() for x in x.split()))
print(text_df["text"].head())
  • output
    0  mlb cincinnati reds t shirt size xl
    1  razer blackwidow chroma keyboard this keyboard...
    2  ava-viv blouse adorable top with a hint of lac...
    3  leather horse statues new with tags. leather h...
    4  24k gold plated rose complete with certificate...
    

Non-alphanumeric data removing: number, symbol, emoji, HTML tag…

text_df["text"].str.replace(r"\d+", "")
text_df["text"].str.replace('[^\w\s]','')
text_df["text"].str.replace(r"[︰-@]", "")
print(text_df["text"].head())
  • output
    0  mlb cincinnati reds t shirt size xl
    1  razer blackwidow chroma keyboard this keyboard...
    2  avaviv blouse adorable top with a hint of lace...
    3  leather horse statues new with tags leather ho...
    4  k gold plated rose complete with certificate o...
    

    Tokenization: from sentence to word

    nltk.download('punkt')
    from nltk.tokenize import word_tokenize
    text_df["text"].apply(word_tokenize)
    print(text_df["text"].head())
    
  • output
    0  [mlb, cincinnati, reds, t, shirt, size, xl]
    1  [razer, blackwidow, chroma, keyboard, this, ke...
    2  [avaviv, blouse, adorable, top, with, a, hint,...
    3  [leather, horse, statues, new, with, tags, lea...
    4  [k, gold, plated, rose, complete, with, certif...
    

Stop Words removing

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
text_df["text"].apply(lambda x: [item for item in x if item not in stop_words])
print(text_df["text"].head())
  • output
    0  [mlb, cincinnati, reds, shirt, size, xl]
    1  [razer, blackwidow, chroma, keyboard, keyboard...
    2  [avaviv, blouse, adorable, top, hint, lace, ke...
    3  [leather, horse, statues, new, tags, leather, ...
    4  [k, gold, plated, rose, complete, certificate,...
    

    Stemming and Lemmatization

Stemming

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
text_df["text_nonstop"].apply(lambda x: [stemmer.stem(e) for e in x])
print(text_df["text"].head())

Lemmatization

nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
text_df["stemmed"].apply(lambda x: [lemmatizer.lemmatize(e) for e in x])
print(text_df["text"].head())
  • output
    0  [mlb, cincinnati, red, shirt, size, xl]
    1  [razer, blackwidow, chroma, keyboard, keyboard...
    2  [avaviv, blous, ador, top, hint, lace, key, ho...
    3  [leather, hors, statu, new, tag, leather, hors...
    4  [k, gold, plate, rose, complet, certif, authent]
    

Tokenization and Feature Selection: Bag of words, Tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=0.03)
tfidf = vectorizer.fit_transform(text_df["text"])
tfidf.toarray()
  • output
    array([[ 0., 0., 0.70710678, 0.4554374, 0., 0.],
         [ 0., 0.62951441, 0., 0.70710678, 0., 0.],
         [ 0., 0., 0.4554374 , 0.62951441, 0.62951441, 0.],
         [ 0., 0.34567441, 0., 0.90121218, 0., 0.]])
    

Word Embedding: Word2Vec

from gensim.models import Word2Vec
model = Word2Vec(text_df["text"], size=150, window=10)
model.train(text_df["text"], total_examples=len(text_df["text"]), epochs=10)
model.wv.most_similar(positive=["home"])
  • output
    [('depressed', 0.7399601936340332),
     ('marge', 0.6840592622756958),
     ('terrific', 0.6683255434036255),
     ('hammock', 0.6648209095001221),
     ('bongo', 0.6537353992462158),
     ('suspicious', 0.650877833366394),
     ('brad', 0.6480956077575684),
     ('dr_hibbert', 0.6459894180297852),
     ('crummy', 0.6455820798873901),
     ('tab', 0.6413768529891968)]])
    

Conclusion

“Better Data > Fancier Algorithms“

It’s a quote by EliteDataScience.com which you often hear in this field. Especially in NLP, we have to make sure that computer under stand our language or context, meaning. Otherwise they cannot replace our daily communication task like Chatbot do.

Take your time and tidying up as much as possible!

References

Updated:

Leave a Comment