Uses basic decision trees and adaboost: weighs the mistaken ones higher. Bagging samples randomly and boosting does on the errored ones. Bagging can be done parallelly. Gradient boosting is usually more powerful, but takes longer to train more likely to overfit and harder to tune.
from sklearn.ensemble import GradientBoostingClassifier
GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)
#get metrics and train test split
Final Model Selection
Notice new words won't be recognized.
#something about how the old order needs to be dropped
X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True),
pd.DataFrame(tfidf_train.toarray())], axis=1)
FInal Final Model Selection
start = time.time()
#fitting
end = time.time()
fit_time = end-start