4ml2

Uses basic decision trees and adaboost: weighs the mistaken ones higher. Bagging samples randomly and boosting does on the errored ones. Bagging can be done parallelly. Gradient boosting is usually more powerful, but takes longer to train more likely to overfit and harder to tune.

from sklearn.ensemble import GradientBoostingClassifier
GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)
#get metrics and train test split

Final Model Selection

Notice new words won't be recognized.

#something about how the old order needs to be dropped
X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True),
          pd.DataFrame(tfidf_train.toarray())], axis=1)

FInal Final Model Selection

start = time.time()
#fitting
end = time.time()
fit_time = end-start

Last updated