2vectorizing
Encoded text as integer to create feature vector
Hyper parameters: N-grams(uni-bigrams, just bigrams, uni-bi-trigrams)
Overview
Each Column is the word/element and each Row is the document, each element is the count
Count Vectorizer
Since mostly zeros, Sparse Matrix, only stores non zero elements.
from sklearn.feature_extraction.text import CountVectorizer
def clean_text(text):
....
count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(data) #fit would just learn, fit transform returns
Analysis
X_counts.shape
count_vect.get_feature_names()
X_counts_df.columns = pd.DataFrame(X_counts.toarray()) #X_counts is sparse rn
X_counts_df.columns = count_vect.get_feature_names()
N-grams
Same a count vectorizer, but now columns represent all combinations of adjacent words of length n in your text
nlp is an interesting
=> nlp is, is an, an interesting
=> nlp is an, is an interesting,
Wants a string passed into it, not a list
CountVectorizer(ngram_range=(1,2)) #unigrams and bigrams
CountVectorizer(ngram_range=(1,3)) #unigrams and bigrams and trigrams
CountVectorizer(ngram_range=(2,2)) #bigrams
#see above for training/analysis
Term Frequency Inverse Document Frequency TF-IDF
Cells represent a weighting to how important that word is to the document taking into account rarity and size of doc
cell = word in doc / total # of doc * log(total doc #/# of docs with i)
from sklearn.feature_extraction.text import tf = TfidfVectorizer() X = tf.fit_transform(data)
Last updated