Package

org.platanios.tensorflow.api.ops.training

optimizers

Permalink

package optimizers

Linear Supertypes

Ordering

Alphabetic
By Inheritance

Inherited

optimizers
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Type Members

class AMSGrad extends Optimizer

Optimizer that implements the AMSGrad optimization algorithm, presented in [On the Convergence of Adam and Beyond](https://openreview.net/pdf?id=ryQu7f-RZ).
Optimizer that implements the AMSGrad optimization algorithm, presented in [On the Convergence of Adam and Beyond](https://openreview.net/pdf?id=ryQu7f-RZ).
Initialization:
```
m_0 = 0     // Initialize the 1st moment vector
v_0 = 0     // Initialize the 2nd moment vector
v_hat_0 = 0 // Initialize the 2nd moment max vector
t = 0       // Initialize the time step
```
The AMSGrad update for step t is as follows:
```
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
v_hat_t = max(v_t, v_hat_{t-1})
variable -= learningRate_t * m_t / (sqrt(v_hat_t) + epsilon)
```
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
For more information on this algorithm, please refer to this [paper](https://openreview.net/pdf?id=ryQu7f-RZ).
class AdaDelta extends Optimizer

Optimizer that implements the AdaDelta optimization algorithm.
Optimizer that implements the AdaDelta optimization algorithm.
The AdaDelta update is as follows:
```
accumulator = rho * accumulator + (1 - rho) * gradient
update = sqrt(accumulatorUpdate + epsilon) * rsqrt(accumulator + epsilon) * gradient
accumulatorUpdate = rho * accumulatorUpdate + (1 - rho) * square(update)
variable -= update
```
For more information on this algorithm, please refer to this [paper](http://arxiv.org/abs/1212.5701) ([PDF](http://arxiv.org/pdf/1212.5701v1.pdf)).
class AdaFactor extends AnyRef
class AdaGrad extends Optimizer

Optimizer that implements the AdaGrad optimization algorithm.
Optimizer that implements the AdaGrad optimization algorithm.
The AdaGrad update is as follows:
```
accumulator += gradient * gradient
variable -= stepSize * gradient * (1 / sqrt(accumulator))
```
For more information on this algorithm, please refer to this [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf).
class Adam extends Optimizer

Optimizer that implements the Adam optimization algorithm.
Optimizer that implements the Adam optimization algorithm.
Initialization:
```
m_0 = 0  // Initialize the 1st moment vector
v_0 = 0  // Initialize the 2nd moment vector
t = 0    // Initialize the time step
```
The Adam update for step t is as follows:
```
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
variable -= learningRate_t * m_t / (sqrt(v_t) + epsilon)
```
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since the Adam optimizer uses the formulation just before Section 2.1 of the [Kingma and Ba paper](https://arxiv.org/abs/1412.6980) rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).
For more information on this algorithm, please refer to this [paper](https://arxiv.org/abs/1412.6980) ([PDF](https://arxiv.org/pdf/1412.6980.pdf)).
class GradientDescent extends Optimizer

Optimizer that implements the gradient descent algorithm and includes support for learning rate decay, momentum, and Nesterov acceleration.
class LazyAMSGrad extends AMSGrad

Optimizer that implements a variant of the AMSGrad optimization algorithm that handles sparse updates more efficiently.
Optimizer that implements a variant of the AMSGrad optimization algorithm that handles sparse updates more efficiently.
The original AMSGrad algorithm maintains three moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates the moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Initialization:
```
m_0 = 0     // Initialize the 1st moment vector
v_0 = 0     // Initialize the 2nd moment vector
v_hat_0 = 0 // Initialize the 2nd moment max vector
t = 0       // Initialize the time step
```
The AMSGrad update for step t is as follows:
```
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
v_hat_t = max(v_t, v_hat_{t-1})
variable -= learningRate_t * m_t / (sqrt(v_hat_t) + epsilon)
```
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does not apply momentum to variable slices if they are not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also not applied to the entire momentum accumulator. This means that the sparse behavior is not equivalent to the dense behavior.
For more information on this algorithm, please refer to this [paper](https://openreview.net/pdf?id=ryQu7f-RZ).
class LazyAdam extends Adam

Optimizer that implements a variant of the Adam optimization algorithm that handles sparse updates more efficiently.
Optimizer that implements a variant of the Adam optimization algorithm that handles sparse updates more efficiently.
The original Adam algorithm maintains two moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates the moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Initialization:
```
m_0 = 0  // Initialize the 1st moment vector
v_0 = 0  // Initialize the 2nd moment vector
t = 0    // Initialize the time step
```
The Adam update for step t is as follows:
```
learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
variable -= learningRate_t * m_t / (sqrt(v_t) + epsilon)
```
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since the Adam optimizer uses the formulation just before Section 2.1 of the [Kingma and Ba paper](https://arxiv.org/abs/1412.6980) rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.
The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does not apply momentum to variable slices if they are not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also not applied to the entire momentum accumulator. This means that the sparse behavior is not equivalent to the dense behavior.
For more information on the original Adam algorithm, please refer to this [paper](https://arxiv.org/abs/1412.6980) ([PDF](https://arxiv.org/pdf/1412.6980.pdf)).
The original Adam algorithm (described above) maintains two moving-average accumulators for each trainable variable, which are updated at every step. This class provides a lazier handling of gradient updates for sparse variables. It only updates moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
trait Optimizer extends AnyRef
class RMSProp extends Optimizer

Optimizer that implements the RMSProp optimization algorithm.
Optimizer that implements the RMSProp optimization algorithm.
The RMSProp update is as follows:
```
rmsAcc = decay * rmsAcc + (1 - decay) * (gradient ^ 2)
momAcc = momentum * momAcc + learningRate * gradient / sqrt(rmsAcc + epsilon)
variable -= momAcc
```
This implementation of RMSProp uses plain momentum, not Nesterov momentum.
If the centered version is used, the algorithm additionally maintains a moving (discounted) average of the gradients, and uses that average to estimate the variance:
```
meanGradAcc = decay * rmsAcc + (1 - decay) * gradient
rmsAcc = decay * rmsAcc + (1 - decay) * (gradient ^ 2)
momAcc = momentum * momAcc + learningRate * gradient / sqrt(rmsAcc - (meanGradAcc ^ 2) + epsilon)
variable -= momAcc
```
class YellowFin extends GradientDescent

Optimizer that implements the YellowFin algorithm.
Optimizer that implements the YellowFin algorithm.
Please refer to [Zhang et. al., 2017](https://arxiv.org/abs/1706.03471) for details.

Value Members

object AMSGrad
object AdaDelta
object AdaGrad
object Adam
object GradientDescent
object LazyAMSGrad
object LazyAdam
object RMSProp
object YellowFin
package schedules

Inherited from AnyRef

Inherited from Any

Ungrouped