> For the complete documentation index, see [llms.txt](https://openai.gitbook.io/code-cheatsheets/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://openai.gitbook.io/code-cheatsheets/ml/nlp/1preprocessing.md).

# Pre-processing text

## Remove punctuation

```python
import string
string.punctuation
def remove_punct(text):
  text_nopunct = "".join[char for char in text if char not in string.punctuation]
  return text_nopunct
```

## Tokenize

Splitting sentence into words

```python
def tokenize(t):
  tokens = re.split('\W+', text)
  return tokens
```

## Lowercase

.lower()

```python
data['body_text_token'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))
```

## Remove StopWords

```python
def remove_stopwords(tokenized_list):
  [word for word in tokenized_list if word not in nltk.corpus.stopwords.words('english')]
```

## Stemming

Leave root word, chopping off suffix

* Problems: Meanness, Meaning -> Mean
* Explicitly correlates words with similar meanings

  **PorterStemmer**

  ```python
  ps = nltk.PorterStemmer()
  ps.stem('goose') #goos
  ps.stem('geese') #gees
  ```

  **Lemmatizing**
* Grouping together inflected form of words, basically has goal of stemming
* Lemmatizing is more accurate as uses more informed analysis, but takes more time
* If not in wordnet, just leave word

  **Wordnet lemmatizer**

  ```python
  wn = nltk.WordNetLemmatizer()
  wn.lemmatize('goose') #goose
  wn.lemmatize('geese') #goose
  ```
