Regularization

Regularization L2's derivative is W so you subtract W from Wlearning rate causing *weight decay

Regularization makes cost perhaps not monotonically decrease.

Dropout regularization

  • Have a keep prob for each layer and randomly drops nodes

  • Mathematically similar to L2 regularization and makes sure network can't depend on one node

  • Should always check J decreasing, but dropout makes it more jank. Test code works without dropout and check graph is decreasing. Then add dropout and pray.

Early Stopping

plot Dev vs training error and stop early before Dev error increases, basically smaller W regularization with effort of trying different lambda, against orthogonality

Orthogonalization

so many hyper parameters

Weight Intialization

You want smaller weights the more weights

Randn * sqrt(2/neurons in that layer) is a good appriozimiation giving you good variance and size

Last updated