Pre-processing text

Remove punctuation

import string
string.punctuation
def remove_punct(text):
  text_nopunct = "".join[char for char in text if char not in string.punctuation]
  return text_nopunct

Tokenize

Splitting sentence into words

def tokenize(t):
  tokens = re.split('\W+', text)
  return tokens

Lowercase

.lower()

data['body_text_token'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

Remove StopWords

def remove_stopwords(tokenized_list):
  [word for word in tokenized_list if word not in nltk.corpus.stopwords.words('english')]

Stemming

Leave root word, chopping off suffix

Problems: Meanness, Meaning -> Mean
Explicitly correlates words with similar meanings
PorterStemmer
```
ps = nltk.PorterStemmer()
ps.stem('goose') #goos
ps.stem('geese') #gees
```
Lemmatizing
Grouping together inflected form of words, basically has goal of stemming
Lemmatizing is more accurate as uses more informed analysis, but takes more time

If not in wordnet, just leave word

Wordnet lemmatizer

wn = nltk.WordNetLemmatizer()
wn.lemmatize('goose') #goose
wn.lemmatize('geese') #goose

PreviousNLP Next2vectorizing

Last updated 5 years ago