How does BN help optimzation

 

Batch Normalization

  • popular belief:

    controlling the change of the layers’ input distributions during training to reduce the so-called “internal covariate shift”

  • truth:

    makes the optimization landscape significantly smoother, inducing a more predictive and stable behavior of the gradients, allowing for faster training.

ICS

  • ICS refers to the change in the distribution of layer inputs caused by updates to the preceding layers.
  • conjectured that such continual change negatively impacts training.
  • BatchNorm might not even be reducing internal covariate shift.

Impact

  • makes the landscape of the corresponding optimization problem significantly more smooth
  • gradients are more predictive and thus allows for use of larger range of learning rates and faster network convergence
  • under natural conditions, the Lipschitzness of both the loss and the gradients are improved in models with BatchNorm

controlling internal covariate shift?

  • train networks with random noise injected after BatchNorm layers. Specifically, we perturb each activation for each sample in the batch using i.i.d. noise sampled from a non-zero mean and non-unit variance distribution
  • visualizes the training behavior of standard, BatchNorm and “noisy” BatchNorm networks

BatchNorm Increase ICS

The smoothing effect of BatchNorm

  • non-BatchNorm, deep neural network, the loss function tends to have a large number of “kinks”
  • makes the gradients more reliable and predictive, enables any (gradient–based) training algorithm to take larger steps

$L_{p}-Normalization$

  • normalizes them by the average of their p-norm