3featureengineering

Creating New Features to get most out of data, can be complex topic

Creating New Features

length title

Check if feature actually important

Lets make some histagrams

from matplotlib import pyplot
import numpy as np
%matplotlib inline

bins = np.linspace(0, 200, 40) #40 #'sfrom 0-200,
pyplot.hist(data[data['label'] == 'spam']['body_len'], bins, alpha=0.5, normed=True) #normed normalized to some scale
pyplot.hist(data[data['label'] == 'ham']['body_len'], bins, alpha=0.5, normed=True)
pyplot.legend(loc='upper left')
pyplot.show()

Err on side if leaving feature in model to see if its good

Transformations

Why?
If left skewed, log transformed data pulls it to the middle. Model might dig too much into a tail inside of exploring the differences of the majority.

Where

Prime candiates dramatic skew with long tail or few outliers
Bimodal isn't heavily skewed without clear outliers

Box-Cox Power Transformation

Usually use exponents, y^x => y is value x is exponent.
Aim for normal distrubution, dont worry about 0
Test range of exponents, get measurement criteria

Previous2vectorizing Next4ml

Last updated 7 years ago