Pre-processing text

Remove punctuation

import string
string.punctuation
def remove_punct(text):
  text_nopunct = "".join[char for char in text if char not in string.punctuation]
  return text_nopunct

Tokenize

Splitting sentence into words

def tokenize(t):
  tokens = re.split('\W+', text)
  return tokens

Lowercase

.lower()

data['body_text_token'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

Remove StopWords

def remove_stopwords(tokenized_list):
  [word for word in tokenized_list if word not in nltk.corpus.stopwords.words('english')]

Stemming

Leave root word, chopping off suffix

  • Problems: Meanness, Meaning -> Mean

  • Explicitly correlates words with similar meanings

    PorterStemmer

    ps = nltk.PorterStemmer()
    ps.stem('goose') #goos
    ps.stem('geese') #gees

    Lemmatizing

  • Grouping together inflected form of words, basically has goal of stemming

  • Lemmatizing is more accurate as uses more informed analysis, but takes more time

  • If not in wordnet, just leave word

    Wordnet lemmatizer

    wn = nltk.WordNetLemmatizer()
    wn.lemmatize('goose') #goose
    wn.lemmatize('geese') #goose

Last updated