2vectorizing

Encoded text as integer to create feature vector

Hyper parameters: N-grams(uni-bigrams, just bigrams, uni-bi-trigrams)

Overview

Each Column is the word/element and each Row is the document, each element is the count

Count Vectorizer

Since mostly zeros, Sparse Matrix, only stores non zero elements.

from sklearn.feature_extraction.text import CountVectorizer
def clean_text(text):
  ....
count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(data) #fit would just learn, fit transform returns

Analysis

X_counts.shape
count_vect.get_feature_names()
X_counts_df.columns = pd.DataFrame(X_counts.toarray()) #X_counts is sparse rn
X_counts_df.columns = count_vect.get_feature_names()

N-grams

Same a count vectorizer, but now columns represent all combinations of adjacent words of length n in your text

nlp is an interesting
- => nlp is, is an, an interesting
- => nlp is an, is an interesting,
Wants a string passed into it, not a list

CountVectorizer(ngram_range=(1,2)) #unigrams and bigrams
CountVectorizer(ngram_range=(1,3)) #unigrams and bigrams and trigrams
CountVectorizer(ngram_range=(2,2)) #bigrams
#see above for training/analysis

Term Frequency Inverse Document Frequency TF-IDF

Cells represent a weighting to how important that word is to the document taking into account rarity and size of doc

cell = word in doc / total # of doc * log(total doc #/# of docs with i)

from sklearn.feature_extraction.text import
tf = TfidfVectorizer()
X = tf.fit_transform(data)

PreviousPre-processing text Next3featureengineering

Last updated 7 years ago