Heuristics#

Performance#

  • Stochastic gradient descent.

  • Standardization

  • Centering the hidden units. For example replacing sigmoid with tanh function.

  • Differing learning rate for each layer of weights.

  • Emphasize rare classes by using it more often.

  • Second-order optimization

    • Nonlinear conjugate gradient

    • Stochastic Levenberg Marquardt

  • Acceleration schemes:

    • RMSprop

    • Adam

    • AMSGrad

Avoid Bad Local Minima#

  • Stochastic gradient descent.

  • Momentum

Avoid Overfitting#

  • Ensembles of neural nets with random intnial weights and bagging.

  • L2 regularization (aka weight decay)

  • Dropout, randomly disable a set of nodes of some hidden layers every epoch. When a node is disabled its weight is stored however it’s just not participating in changing its weight on the epoch. This simulates an ensemble.