Optimizer that implements the AMSGrad optimization algorithm, presented in [On the Convergence of Adam and Beyond](https://openreview.net/pdf?id=ryQu7f-RZ).
Optimizer that implements the AMSGrad optimization algorithm, presented in [On the Convergence of Adam and Beyond](https://openreview.net/pdf?id=ryQu7f-RZ).
Initialization:
m_0 = 0 // Initialize the 1st moment vector v_0 = 0 // Initialize the 2nd moment vector v_hat_0 = 0 // Initialize the 2nd moment max vector t = 0 // Initialize the time step
The AMSGrad update for step t
is as follows:
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t) m_t = beta1 * m_{t-1} + (1 - beta1) * gradient v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient v_hat_t = max(v_t, v_hat_{t-1}) variable -= learningRate_t * m_t / (sqrt(v_hat_t) + epsilon)
The default value of 1e-8
for epsilon might not be a good default in general. For example, when training an
Inception network on ImageNet a current good choice is 1.0
or 0.1
.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because
of tf.gather
or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were
not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1
) is also applied
to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in
contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
For more information on this algorithm, please refer to this [paper](https://openreview.net/pdf?id=ryQu7f-RZ).
Optimizer that implements the AdaDelta optimization algorithm.
Optimizer that implements the AdaDelta optimization algorithm.
The AdaDelta update is as follows:
accumulator = rho * accumulator + (1 - rho) * gradient update = sqrt(accumulatorUpdate + epsilon) * rsqrt(accumulator + epsilon) * gradient accumulatorUpdate = rho * accumulatorUpdate + (1 - rho) * square(update) variable -= update
For more information on this algorithm, please refer to this [paper](http://arxiv.org/abs/1212.5701) ([PDF](http://arxiv.org/pdf/1212.5701v1.pdf)).
Optimizer that implements the AdaGrad optimization algorithm.
Optimizer that implements the AdaGrad optimization algorithm.
The AdaGrad update is as follows:
accumulator += gradient * gradient
variable -= stepSize * gradient * (1 / sqrt(accumulator))
For more information on this algorithm, please refer to this [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf).
Optimizer that implements the Adam optimization algorithm.
Optimizer that implements the Adam optimization algorithm.
Initialization:
m_0 = 0 // Initialize the 1st moment vector v_0 = 0 // Initialize the 2nd moment vector t = 0 // Initialize the time step
The Adam update for step t
is as follows:
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t) m_t = beta1 * m_{t-1} + (1 - beta1) * gradient v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient variable -= learningRate_t * m_t / (sqrt(v_t) + epsilon)
The default value of 1e-8
for epsilon might not be a good default in general. For example, when training an
Inception network on ImageNet a current good choice is 1.0
or 0.1
. Note that since the Adam optimizer uses the
formulation just before Section 2.1 of the [Kingma and Ba paper](https://arxiv.org/abs/1412.6980) rather than the
formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because
of tf.gather
or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were
not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1
) is also applied
to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in
contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
For more information on this algorithm, please refer to this [paper](https://arxiv.org/abs/1412.6980) ([PDF](https://arxiv.org/pdf/1412.6980.pdf)).
Optimizer that implements the gradient descent algorithm and includes support for learning rate decay, momentum, and Nesterov acceleration.
Optimizer that implements a variant of the AMSGrad optimization algorithm that handles sparse updates more efficiently.
Optimizer that implements a variant of the AMSGrad optimization algorithm that handles sparse updates more efficiently.
The original AMSGrad algorithm maintains three moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates the moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Initialization:
m_0 = 0 // Initialize the 1st moment vector v_0 = 0 // Initialize the 2nd moment vector v_hat_0 = 0 // Initialize the 2nd moment max vector t = 0 // Initialize the time step
The AMSGrad update for step t
is as follows:
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t) m_t = beta1 * m_{t-1} + (1 - beta1) * gradient v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient v_hat_t = max(v_t, v_hat_{t-1}) variable -= learningRate_t * m_t / (sqrt(v_hat_t) + epsilon)
The default value of 1e-8
for epsilon might not be a good default in general. For example, when training an
Inception network on ImageNet a current good choice is 1.0
or 0.1
.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because
of tf.gather
or an embedding lookup in the forward pass) does not apply momentum to variable slices if they are
not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1
) is also not
applied to the entire momentum accumulator. This means that the sparse behavior is not equivalent to the dense
behavior.
For more information on this algorithm, please refer to this [paper](https://openreview.net/pdf?id=ryQu7f-RZ).
Optimizer that implements a variant of the Adam optimization algorithm that handles sparse updates more efficiently.
Optimizer that implements a variant of the Adam optimization algorithm that handles sparse updates more efficiently.
The original Adam algorithm maintains two moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates the moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Initialization:
m_0 = 0 // Initialize the 1st moment vector v_0 = 0 // Initialize the 2nd moment vector t = 0 // Initialize the time step
The Adam update for step t
is as follows:
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t) m_t = beta1 * m_{t-1} + (1 - beta1) * gradient v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient variable -= learningRate_t * m_t / (sqrt(v_t) + epsilon)
The default value of 1e-8
for epsilon might not be a good default in general. For example, when training an
Inception network on ImageNet a current good choice is 1.0
or 0.1
. Note that since the Adam optimizer uses the
formulation just before Section 2.1 of the [Kingma and Ba paper](https://arxiv.org/abs/1412.6980) rather than the
formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because
of tf.gather
or an embedding lookup in the forward pass) does not apply momentum to variable slices if they are
not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1
) is also not
applied to the entire momentum accumulator. This means that the sparse behavior is not equivalent to the dense
behavior.
For more information on the original Adam algorithm, please refer to this [paper](https://arxiv.org/abs/1412.6980) ([PDF](https://arxiv.org/pdf/1412.6980.pdf)).
The original Adam algorithm (described above) maintains two moving-average accumulators for each trainable variable, which are updated at every step. This class provides a lazier handling of gradient updates for sparse variables. It only updates moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Optimizer that implements the RMSProp optimization algorithm.
Optimizer that implements the RMSProp optimization algorithm.
The RMSProp update is as follows:
rmsAcc = decay * rmsAcc + (1 - decay) * (gradient ^ 2) momAcc = momentum * momAcc + learningRate * gradient / sqrt(rmsAcc + epsilon) variable -= momAcc
This implementation of RMSProp uses plain momentum, not Nesterov momentum.
If the centered version is used, the algorithm additionally maintains a moving (discounted) average of the gradients, and uses that average to estimate the variance:
meanGradAcc = decay * rmsAcc + (1 - decay) * gradient rmsAcc = decay * rmsAcc + (1 - decay) * (gradient ^ 2) momAcc = momentum * momAcc + learningRate * gradient / sqrt(rmsAcc - (meanGradAcc ^ 2) + epsilon) variable -= momAcc
Optimizer that implements the YellowFin algorithm.
Optimizer that implements the YellowFin algorithm.
Please refer to [Zhang et. al., 2017](https://arxiv.org/abs/1706.03471) for details.