Scheduling method helper for composing two existing learning rate scheduling methods.
Scheduling method helper for composing two existing learning rate scheduling methods.
The resulting learning rate is the initial learning rate after having applied schedule2
on it,
and then schedule1
.
Cosine decay method.
Cosine decay method.
This method applies a cosine decay function to a provided initial learning rate (i.e., value
). It requires a
step value to be provided in it's application function, in order to compute the decayed learning rate. You may
simply pass a TensorFlow variable that you increment at each training step.
The decayed value is computed as follows:
cosineDecay = 0.5 * (1 + cos(pi * min(step, cycleSteps) / cycleSteps)) decayed = value * ((1 - alpha) * cosineDecay + alpha)
Cycle-linear 10x decay method.
Cycle-linear 10x decay method.
This method applies a cycle-linear decay function to a provided initial learning rate (i.e., value
). It requires a
step value to be provided in it's application function, in order to compute the decayed learning rate. You may
simply pass a TensorFlow variable that you increment at each training step.
The decayed value is computed as follows:
cyclePosition = 1 - abs(((step % (2 * cycleSteps)) - cycleSteps) / cycleSteps) decayed = value * (0.1 + cyclePosition) * 3
Exponential decay method.
Exponential decay method.
This method applies an exponential decay function to a provided initial learning rate (i.e., value
). It requires a
step value to be provided in it's application function, in order to compute the decayed learning rate. You may
simply pass a TensorFlow variable that you increment at each training step.
The decayed value is computed as follows:
decayed = value * decayRate ^ (step / decaySteps)
where if staircase = true
, then (step / decaySteps)
is an integer division and the decayed learning rate follows
a staircase function.
A particular instance of ExponentialDecay that was used in [Luong (2016)](https://github.com/lmthang/thesis).
Trait for implementing optimization learning rate scheduling methods.
Trait for implementing optimization learning rate scheduling methods.
When training a model, it is often recommended to lower the learning rate as the training progresses. Scheduling methods can be used for that purpose. They define ways in which to schedule the learning rate as training progresses.
Square root decay method.
Square root decay method.
This method applies a square root decay function to a provided initial learning rate (i.e., value
). It requires a
step value to be provided in it's application function, in order to compute the decayed learning rate. You may
simply pass a TensorFlow variable that you increment at each training step.
The decayed value is computed as follows:
decayed = value * decayFactor / sqrt(max(step, decayThreshold))
Learning rate schedule that implements a warm-up scheme, similar to the one proposed in [Attention is All You Need (Section 5.3)](https://arxiv.org/pdf/1706.03762.pdf).
Learning rate schedule that implements a warm-up scheme, similar to the one proposed in [Attention is All You Need (Section 5.3)](https://arxiv.org/pdf/1706.03762.pdf).
For the first warmUpSteps
steps the learning rate is multiplied by:
exp(log(warmUpFactor) / step) ^ (warmUpSteps - step)
.
Learning rate schedule that implements a warm-up scheme, similar to the one proposed in [Attention is All You Need (Section 5.3)](https://arxiv.org/pdf/1706.03762.pdf).
Learning rate schedule that implements a warm-up scheme, similar to the one proposed in [Attention is All You Need (Section 5.3)](https://arxiv.org/pdf/1706.03762.pdf).
For the first warmUpSteps
steps the learning rate is multiplied by:
start + ((1.0f - start) / warmUpSteps) * step
.
Dummy scheduling method representing no schedule being used.
Dummy scheduling method representing no schedule being used. Useful as a default value for Schedule
-valued
function arguments.