org.platanios.tensorflow.api.ops.training.optimizers
Learning rate. Must be > 0
. If used with decay
, then this argument
specifies the initial value of the learning rate.
Learning rate decay method to use for each update.
Exponential decay rate for the first moment estimates.
Exponential decay rate for the second moment estimates.
If true
, Nesterov momentum is used for the updates.
Small constant used for numerical stability. This epsilon corresponds to "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), and not to the epsilon in Algorithm 1 of the paper.
If true
, the gradient descent updates will be protected by a lock. Otherwise, the
behavior is undefined, but may exhibit less contention.
Optional summary tag name to use for the learning rate value. If null
, no summary
is created for the learning rate. Otherwise, a scalar summary is created which can
be monitored using TensorBoard.
Name for this optimizer.
Applies the updates corresponding to the provided gradient, to the provided variable.
Applies the updates corresponding to the provided gradient, to the provided variable.
Gradient tensor.
Variable.
Option containing current iteration in the optimization loop, if one has been provided.
Created op that applies the provided gradient to the provided variable.
Creates an op that applies the provided gradients to the provided variables.
Creates an op that applies the provided gradients to the provided variables.
Sequence with gradient-variable pairs.
Optional Variable
to increment by one after the variables have been updated.
Name for the created op.
Created op.
Applies the updates corresponding to the provided gradient, to the provided variable.
Applies the updates corresponding to the provided gradient, to the provided variable.
The OutputIndexedSlices object specified by gradient
in this function is by default pre-processed in
applySparseDuplicateIndices
to remove duplicate indices (refer to that function's documentation for details).
Optimizers which can tolerate or have correct special cases for duplicate sparse indices may override
applySparseDuplicateIndices
instead of this function, avoiding that overhead.
Gradient tensor.
Variable.
Option containing current iteration in the optimization loop, if one has been provided.
Created op that applies the provided gradient to the provided variable.
Applies the updates corresponding to the provided gradient (with potentially duplicate indices), to the provided variable.
Applies the updates corresponding to the provided gradient (with potentially duplicate indices), to the provided variable.
Optimizers which override this method must deal with OutputIndexedSlices objects such as the following:
OutputIndexedSlices(indices=[0, 0], values=[1, 1], denseShape=[1])
, which contain duplicate indices. The
correct interpretation in that case should be: OutputIndexedSlices(values=[2], indices=[0], denseShape=[1])
.
Many optimizers deal incorrectly with repeated indices when updating based on sparse gradients (e.g. summing squares rather than squaring the sum, or applying momentum terms multiple times). Adding first is always the correct behavior, so this is enforced here by reconstructing the OutputIndexedSlices to have only unique indices, and then calling applySparse.
Optimizers which deal correctly with repeated indices may instead override this method to avoid the induced overhead.
Gradient tensor.
Variable.
Option containing current iteration in the optimization loop, if one has been provided.
Created op that applies the provided gradient to the provided variable.
Exponential decay rate for the first moment estimates.
Exponential decay rate for the second moment estimates.
Computes the gradients of loss
with respect to the variables in variables
, if provided, otherwise with respect
to all the trainable variables in the graph where loss
is defined.
Computes the gradients of loss
with respect to the variables in variables
, if provided, otherwise with respect
to all the trainable variables in the graph where loss
is defined.
Loss value whose gradients will be computed.
Optional gradients to back-propagate for loss
.
Optional list of variables for which to compute the gradients. Defaults to the
set of trainable variables in the graph where loss
is defined.
Gating method for the gradients computation.
Aggregation method used to combine gradient terms.
Boolean value indicating whether to colocate the gradient ops with the original ops.
Sequence of gradient-variable pairs.
Create all slots needed by this optimizer.
Learning rate decay method to use for each update.
Small constant used for numerical stability.
Creates an op that finishes the gradients application.
Creates an op that finishes the gradients application. This function is called from within an op creation context that uses as its name scope the name that users have chosen for the application of gradients.
Set of ops needed to apply the gradients and update the variable values.
Name scope to use for all the ops created by this function.
Created op output.
Gets a non-slot variable that has been added to this optimizer (or throws an error if no such non-slot variable could be found in this optimizer).
Gets a non-slot variable that has been added to this optimizer (or throws an error if no such non-slot variable could be found in this optimizer).
Variable name.
Graph in which the variable is defined.
Obtained non-slot variable.
Gets all the non-slot variables that have been added to this optimizer.
Gets all the non-slot variables that have been added to this optimizer.
Gets or creates (and adds to this optimizer) a non-slot variable.
Gets or creates (and adds to this optimizer) a non-slot variable.
Variable name.
Variable initial value.
Set of colocation ops for the non-slot variable.
Created non-slot variable.
Gets an existing slot.
Gets an existing slot.
Slot name.
Slot primary variable.
Requested slot variable, or null
if it cannot be found.
Gets an existing slot or creates a new one if none exists, for the provided arguments.
Gets an existing slot or creates a new one if none exists, for the provided arguments.
Slot name.
Slot primary variable.
Slot variable initializer.
Slot variable shape.
Slot variable data type.
Name to use when scoping the variable that needs to be created for the slot.
Requested slot variable.
Boolean value indicating whether to ignore duplicate indices during sparse updates.
Learning rate.
Optional summary tag name to use for the learning rate value.
Creates an op that makes a step towards minimizing loss
by updating the values of the variables in variables
.
Creates an op that makes a step towards minimizing loss
by updating the values of the variables in variables
.
This method simply combines calls computeGradients and applyGradients. If you want to process the gradients before applying them call computeGradients and applyGradients explicitly instead of using this method.
Loss value whose gradients will be computed.
Optional gradients to back-propagate for loss
.
Optional list of variables for which to compute the gradients. Defaults to the
set of trainable variables in the graph where loss
is defined.
Gating method for the gradients computation.
Aggregation method used to combine gradient terms.
Boolean value indicating whether to colocate the gradient ops with the original ops.
Optional Variable
to increment by one after the variables have been updated.
Name for the created op.
Created op.
Name for this optimizer.
Contains variables used by some optimizers that require no slots to be stored.
Contains variables used by some optimizers that require no slots to be stored.
Creates all necessary tensors before applying the gradients.
Returns the names of all slots used by this optimizer.
Returns the names of all slots used by this optimizer.
Some Optimizer subclasses use additional variables.
Supported data types for the loss function, the variables, and the gradients.
Supported data types for the loss function, the variables, and the gradients. Subclasses should override this field allow other float types.
If true
, the gradient descent updates will be protected by a lock.
If true
, Nesterov momentum is used for the updates.
Returns a sequence of variables which encode the current state of this optimizer.
Returns a sequence of variables which encode the current state of this optimizer. The returned variables include both slot variables and non-slot global variables created by this optimizer, in the current graph.
Gets an existing slot or creates a new one using an initial value of zeros, if none exists.
Gets an existing slot or creates a new one using an initial value of zeros, if none exists.
Slot name.
Slot primary variable.
Name to use when scoping the variable that needs to be created for the slot.
Requested slot variable.
Optimizer that implements a variant of the Adam optimization algorithm that handles sparse updates more efficiently.
The original Adam algorithm maintains two moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates the moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.
Initialization:
The Adam update for step
t
is as follows:The default value of
1e-8
for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is1.0
or0.1
. Note that since the Adam optimizer uses the formulation just before Section 2.1 of the [Kingma and Ba paper](https://arxiv.org/abs/1412.6980) rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.The sparse implementation of this algorithm (used when the gradient is an indexed slices object, typically because of
tf.gather
or an embedding lookup in the forward pass) does not apply momentum to variable slices if they are not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1
) is also not applied to the entire momentum accumulator. This means that the sparse behavior is not equivalent to the dense behavior.For more information on the original Adam algorithm, please refer to this [paper](https://arxiv.org/abs/1412.6980) ([PDF](https://arxiv.org/pdf/1412.6980.pdf)).
The original Adam algorithm (described above) maintains two moving-average accumulators for each trainable variable, which are updated at every step. This class provides a lazier handling of gradient updates for sparse variables. It only updates moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.