Heuristics

Contents

Heuristics#

Performance#

Stochastic gradient descent.
Standardization
Centering the hidden units. For example replacing sigmoid with tanh function.
Differing learning rate for each layer of weights.
Emphasize rare classes by using it more often.
Second-order optimization
- Nonlinear conjugate gradient
- Stochastic Levenberg Marquardt
Acceleration schemes:
- RMSprop
- Adam
- AMSGrad

Avoid Bad Local Minima#

Stochastic gradient descent.
Momentum

Avoid Overfitting#

Ensembles of neural nets with random intnial weights and bagging.
L2 regularization (aka weight decay)
Dropout, randomly disable a set of nodes of some hidden layers every epoch. When a node is disabled its weight is stored however it’s just not participating in changing its weight on the epoch. This simulates an ensemble.