Pre-processing text
Remove punctuation
import string
string.punctuation
def remove_punct(text):
text_nopunct = "".join[char for char in text if char not in string.punctuation]
return text_nopunct
Tokenize
Splitting sentence into words
def tokenize(t):
tokens = re.split('\W+', text)
return tokens
Lowercase
.lower()
data['body_text_token'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))
Remove StopWords
def remove_stopwords(tokenized_list):
[word for word in tokenized_list if word not in nltk.corpus.stopwords.words('english')]
Stemming
Leave root word, chopping off suffix
Problems: Meanness, Meaning -> Mean
Explicitly correlates words with similar meanings
PorterStemmer
ps = nltk.PorterStemmer() ps.stem('goose') #goos ps.stem('geese') #gees
Lemmatizing
Grouping together inflected form of words, basically has goal of stemming
Lemmatizing is more accurate as uses more informed analysis, but takes more time
If not in wordnet, just leave word
Wordnet lemmatizer
wn = nltk.WordNetLemmatizer() wn.lemmatize('goose') #goose wn.lemmatize('geese') #goose
Last updated