2vectorizing
Encoded text as integer to create feature vector
Hyper parameters: N-grams(uni-bigrams, just bigrams, uni-bi-trigrams)
Overview
Each Column is the word/element and each Row is the document, each element is the count
Count Vectorizer
Since mostly zeros, Sparse Matrix, only stores non zero elements.
Analysis
N-grams
Same a count vectorizer, but now columns represent all combinations of adjacent words of length n in your text
nlp is an interesting
=> nlp is, is an, an interesting
=> nlp is an, is an interesting,
Wants a string passed into it, not a list
Term Frequency Inverse Document Frequency TF-IDF
Cells represent a weighting to how important that word is to the document taking into account rarity and size of doc
cell = word in doc / total # of doc * log(total doc #/# of docs with i)
Last updated