Uses basic decision trees and adaboost: weighs the mistaken ones higher. Bagging samples randomly and boosting does on the errored ones. Bagging can be done parallelly. Gradient boosting is usually more powerful, but takes longer to train more likely to overfit and harder to tune.
from sklearn.ensemble import GradientBoostingClassifierGradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)#get metrics and train test split
Final Model Selection
Notice new words won't be recognized.
#something about how the old order needs to be droppedX_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_train.toarray())], axis=1)