Pre-processing text

Remove punctuation

import string
string.punctuation
def remove_punct(text):
  text_nopunct = "".join[char for char in text if char not in string.punctuation]
  return text_nopunct

Tokenize

Splitting sentence into words

def tokenize(t):
  tokens = re.split('\W+', text)
  return tokens

Lowercase

.lower()

data['body_text_token'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

Remove StopWords

Stemming

Leave root word, chopping off suffix

  • Problems: Meanness, Meaning -> Mean

  • Explicitly correlates words with similar meanings

    PorterStemmer

    Lemmatizing

  • Grouping together inflected form of words, basically has goal of stemming

  • Lemmatizing is more accurate as uses more informed analysis, but takes more time

  • If not in wordnet, just leave word

    Wordnet lemmatizer

Last updated