Package

org.platanios.tensorflow.api.ops.training

optimizers

Permalink

package optimizers

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. optimizers
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class AMSGrad extends Optimizer

    Permalink

    Optimizer that implements the AMSGrad optimization algorithm, presented in [On the Convergence of Adam and Beyond](https://openreview.net/pdf?id=ryQu7f-RZ).

    Optimizer that implements the AMSGrad optimization algorithm, presented in [On the Convergence of Adam and Beyond](https://openreview.net/pdf?id=ryQu7f-RZ).

    Initialization:

    m_0 = 0     // Initialize the 1st moment vector
    v_0 = 0     // Initialize the 2nd moment vector
    v_hat_0 = 0 // Initialize the 2nd moment max vector
    t = 0       // Initialize the time step

    The AMSGrad update for step t is as follows:

    learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
    m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
    v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
    v_hat_t = max(v_t, v_hat_{t-1})
    variable -= learningRate_t * m_t / (sqrt(v_hat_t) + epsilon)

    The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

    The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

    For more information on this algorithm, please refer to this [paper](https://openreview.net/pdf?id=ryQu7f-RZ).

  2. class AdaDelta extends Optimizer

    Permalink

    Optimizer that implements the AdaDelta optimization algorithm.

    Optimizer that implements the AdaDelta optimization algorithm.

    The AdaDelta update is as follows:

    accumulator = rho * accumulator + (1 - rho) * gradient
    update = sqrt(accumulatorUpdate + epsilon) * rsqrt(accumulator + epsilon) * gradient
    accumulatorUpdate = rho * accumulatorUpdate + (1 - rho) * square(update)
    variable -= update

    For more information on this algorithm, please refer to this [paper](http://arxiv.org/abs/1212.5701) ([PDF](http://arxiv.org/pdf/1212.5701v1.pdf)).

  3. class AdaFactor extends AnyRef

    Permalink
  4. class AdaGrad extends Optimizer

    Permalink

    Optimizer that implements the AdaGrad optimization algorithm.

    Optimizer that implements the AdaGrad optimization algorithm.

    The AdaGrad update is as follows:

    accumulator += gradient * gradient
    variable -= stepSize * gradient * (1 / sqrt(accumulator))

    For more information on this algorithm, please refer to this [paper](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf).

  5. class Adam extends Optimizer

    Permalink

    Optimizer that implements the Adam optimization algorithm.

    Optimizer that implements the Adam optimization algorithm.

    Initialization:

    m_0 = 0  // Initialize the 1st moment vector
    v_0 = 0  // Initialize the 2nd moment vector
    t = 0    // Initialize the time step

    The Adam update for step t is as follows:

    learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
    m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
    v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
    variable -= learningRate_t * m_t / (sqrt(v_t) + epsilon)

    The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since the Adam optimizer uses the formulation just before Section 2.1 of the [Kingma and Ba paper](https://arxiv.org/abs/1412.6980) rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.

    The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used).

    For more information on this algorithm, please refer to this [paper](https://arxiv.org/abs/1412.6980) ([PDF](https://arxiv.org/pdf/1412.6980.pdf)).

  6. class GradientDescent extends Optimizer

    Permalink

    Optimizer that implements the gradient descent algorithm and includes support for learning rate decay, momentum, and Nesterov acceleration.

  7. class LazyAMSGrad extends AMSGrad

    Permalink

    Optimizer that implements a variant of the AMSGrad optimization algorithm that handles sparse updates more efficiently.

    Optimizer that implements a variant of the AMSGrad optimization algorithm that handles sparse updates more efficiently.

    The original AMSGrad algorithm maintains three moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates the moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.

    Initialization:

    m_0 = 0     // Initialize the 1st moment vector
    v_0 = 0     // Initialize the 2nd moment vector
    v_hat_0 = 0 // Initialize the 2nd moment max vector
    t = 0       // Initialize the time step

    The AMSGrad update for step t is as follows:

    learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
    m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
    v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
    v_hat_t = max(v_t, v_hat_{t-1})
    variable -= learningRate_t * m_t / (sqrt(v_hat_t) + epsilon)

    The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

    The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does not apply momentum to variable slices if they are not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also not applied to the entire momentum accumulator. This means that the sparse behavior is not equivalent to the dense behavior.

    For more information on this algorithm, please refer to this [paper](https://openreview.net/pdf?id=ryQu7f-RZ).

  8. class LazyAdam extends Adam

    Permalink

    Optimizer that implements a variant of the Adam optimization algorithm that handles sparse updates more efficiently.

    Optimizer that implements a variant of the Adam optimization algorithm that handles sparse updates more efficiently.

    The original Adam algorithm maintains two moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates the moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.

    Initialization:

    m_0 = 0  // Initialize the 1st moment vector
    v_0 = 0  // Initialize the 2nd moment vector
    t = 0    // Initialize the time step

    The Adam update for step t is as follows:

    learningRate_t = initialLearningRate * sqrt(beta1 - beta2^t) / (1 - beta1^t)
    m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
    v_t = beta2 * v_{t-1} + (1 - beta2) * gradient * gradient
    variable -= learningRate_t * m_t / (sqrt(v_t) + epsilon)

    The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since the Adam optimizer uses the formulation just before Section 2.1 of the [Kingma and Ba paper](https://arxiv.org/abs/1412.6980) rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.

    The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of tf.gather or an embedding lookup in the forward pass) does not apply momentum to variable slices if they are not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also not applied to the entire momentum accumulator. This means that the sparse behavior is not equivalent to the dense behavior.

    For more information on the original Adam algorithm, please refer to this [paper](https://arxiv.org/abs/1412.6980) ([PDF](https://arxiv.org/pdf/1412.6980.pdf)).

    The original Adam algorithm (described above) maintains two moving-average accumulators for each trainable variable, which are updated at every step. This class provides a lazier handling of gradient updates for sparse variables. It only updates moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.

  9. trait Optimizer extends AnyRef

    Permalink

  10. class RMSProp extends Optimizer

    Permalink

    Optimizer that implements the RMSProp optimization algorithm.

    Optimizer that implements the RMSProp optimization algorithm.

    The RMSProp update is as follows:

    rmsAcc = decay * rmsAcc + (1 - decay) * (gradient ^ 2)
    momAcc = momentum * momAcc + learningRate * gradient / sqrt(rmsAcc + epsilon)
    variable -= momAcc

    This implementation of RMSProp uses plain momentum, not Nesterov momentum.

    If the centered version is used, the algorithm additionally maintains a moving (discounted) average of the gradients, and uses that average to estimate the variance:

    meanGradAcc = decay * rmsAcc + (1 - decay) * gradient
    rmsAcc = decay * rmsAcc + (1 - decay) * (gradient ^ 2)
    momAcc = momentum * momAcc + learningRate * gradient / sqrt(rmsAcc - (meanGradAcc ^ 2) + epsilon)
    variable -= momAcc
  11. class YellowFin extends GradientDescent

    Permalink

    Optimizer that implements the YellowFin algorithm.

    Optimizer that implements the YellowFin algorithm.

    Please refer to [Zhang et. al., 2017](https://arxiv.org/abs/1706.03471) for details.

Value Members

  1. object AMSGrad

    Permalink
  2. object AdaDelta

    Permalink
  3. object AdaGrad

    Permalink
  4. object Adam

    Permalink
  5. object GradientDescent

    Permalink
  6. object LazyAMSGrad

    Permalink
  7. object LazyAdam

    Permalink
  8. object RMSProp

    Permalink
  9. object YellowFin

    Permalink
  10. package schedules

    Permalink

Inherited from AnyRef

Inherited from Any

Ungrouped