4ml

K-FOld Cross-Validation: one k subset is test data and k-1 are training data. Repeated k times.

Evaluation:

Accuracy = # Predict correclty/ total Precision= # predicted positive that are positive/# predicted positive

Random Forest

  • Ensemeble method that creates multiple simple models and combines them

  • Constructs a colleciton of decision tree and then aggregates the predictions of each tree to determine the final prediction.

  • Easily handles outliers, missing value, different inputs

  • outputs feature important

  • Can do classification and regression

    #Input is TfidfVectorizer of data
    from sklearn.ensemble import RandomForestClassifer
    #RandomForestClassifer.feature_importances_ are great for understand the ml
    RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

CrossValidation

from sklearn.model_selection import KFold, cross_val_score
rf = RandomForestClassifer(n_jobs=-1) #parallelize
k_fold = KFold(n_splits=5)
cross_val_score(rf, X_features, labels, cv=k_fold, scoring='accuracy', n_jobs=-1)

Holdout Test Set

Defining grid of hyper parameters and exploring them.

Grid Search And Cross Validation

```py from sklearn.modelselection import GridSearchCV rf = RandomForestClassifer() param = {'n_estimators': [10, 150, 300], 'max_depth': [30,60,90, None]} gs = GridSearchCV(rf, param, cv=5, n_jobs=-1) gs.fit(X_tfidf_feat, labels) pd.DataFrame(gs.cv_results).sort_values('mean_test_score', ascending=False) [0:5]

Last updated