4ml
K-FOld Cross-Validation: one k subset is test data and k-1 are training data. Repeated k times.
Evaluation:
Accuracy = # Predict correclty/ total Precision= # predicted positive that are positive/# predicted positive
Random Forest
Ensemeble method that creates multiple simple models and combines them
Constructs a colleciton of decision tree and then aggregates the predictions of each tree to determine the final prediction.
Easily handles outliers, missing value, different inputs
outputs feature important
Can do classification and regression
#Input is TfidfVectorizer of data from sklearn.ensemble import RandomForestClassifer #RandomForestClassifer.feature_importances_ are great for understand the ml RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
CrossValidation
from sklearn.model_selection import KFold, cross_val_score
rf = RandomForestClassifer(n_jobs=-1) #parallelize
k_fold = KFold(n_splits=5)
cross_val_score(rf, X_features, labels, cv=k_fold, scoring='accuracy', n_jobs=-1)Holdout Test Set
Grid Search
Defining grid of hyper parameters and exploring them.
Grid Search And Cross Validation
```py from sklearn.modelselection import GridSearchCV rf = RandomForestClassifer() param = {'n_estimators': [10, 150, 300], 'max_depth': [30,60,90, None]} gs = GridSearchCV(rf, param, cv=5, n_jobs=-1) gs.fit(X_tfidf_feat, labels) pd.DataFrame(gs.cv_results).sort_values('mean_test_score', ascending=False) [0:5]
Last updated