4ml

K-FOld Cross-Validation: one k subset is test data and k-1 are training data. Repeated k times.

Evaluation:

Accuracy = # Predict correclty/ total Precision= # predicted positive that are positive/# predicted positive

Random Forest

Ensemeble method that creates multiple simple models and combines them
Constructs a colleciton of decision tree and then aggregates the predictions of each tree to determine the final prediction.
Easily handles outliers, missing value, different inputs
outputs feature important

Can do classification and regression

#Input is TfidfVectorizer of data
from sklearn.ensemble import RandomForestClassifer
#RandomForestClassifer.feature_importances_ are great for understand the ml
RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)

CrossValidation

from sklearn.model_selection import KFold, cross_val_score
rf = RandomForestClassifer(n_jobs=-1) #parallelize
k_fold = KFold(n_splits=5)
cross_val_score(rf, X_features, labels, cv=k_fold, scoring='accuracy', n_jobs=-1)

Holdout Test Set

from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_features, labels, test_size=0.2)

from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(X_train, y_train)
sorted(zip(rf_model.features_importances_, X_train.columns), reverse=True)[0:10]
y_pred = rf_model.predict(X_test)
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')

Grid Search

Defining grid of hyper parameters and exploring them.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
def train_RF(n_est, depth):
  rf = RandomForestClassifer(n_estimatores=n_est, max_depth=depth, n_jobs=-1)
  rf_model = rf.fit(X_train, y_train)
  y_pred = rf_model.predict(X_test)
  precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
  print(......)

for n_est in [10,50,100]:
  for depth in [10,20,30,None]:
    train_RF(n_est, depth)

Grid Search And Cross Validation

```py from sklearn.modelselection import GridSearchCV rf = RandomForestClassifer() param = {'n_estimators': [10, 150, 300], 'max_depth': [30,60,90, None]} gs = GridSearchCV(rf, param, cv=5, n_jobs=-1) gs.fit(X_tfidf_feat, labels) pd.DataFrame(gs.cv_results).sort_values('mean_test_score', ascending=False) [0:5]

Previous3featureengineering Next4ml2

Last updated 6 years ago