Initial weights (pre-trained or random) weights(i)(j) is the weight for evidence i based on input parameter j i = 0..nClasses j = 0..nInputs (where nInputs = inputLength + 1 for pseudo-input used for bias handling) Modeled as an array of concatenated rows.
Between 0 and 1. As we use AdaGrad, the effective rate will gradually decrease. You can start relatively high: 0.01 - 0.1
How many more samples to try if there is no improvement. Set based on input size, your patience and target accuaracy.
Mini-batch size for "mini-batch SGD". Most of the time, 1 is the best size.
Use a numerically stable softmax version, more tolerant to broad input ranges. Recommended most of the time.
Gradients of the loss function used for backprop weight updates.
Gradients of the loss function used for backprop weight updates. See ref.
We have nClasses gradient vectors of the form:
grad(w(j)) = - (1/m) * sum(x * (target(j) - predicted(j))) + lambda * w(j) where j = 0...nClasses; sum is over a batch of m examples; w(j) = vector of weights for input parameter j; x = input(i) = input vector (iterates over a batch); target(j) = known value (0 or 1) indicating if input x belongs to class j (iterates over a batch); predicted(j) = predicted likelihood of "input x belongs to class j" (iterates over a batch); lambda > 0 is the weight decay parameter necessary for convergence.
Convenience shortcut for feeding a sequence of examples, splitting it into suitable batches.
Predict the likelihoods of each class given the inputs.
y = softmax(x) = normalize(exp(x)) = exp(x(i)) / sum (exp(x))
y = softmax(x) = normalize(exp(x)) = exp(x(i)) / sum (exp(x))
Naive "by the book" version; only works with normalized, stable input.
The index of the currently processed example from the mini-batch. Used to save memory by writing the result directly to predicted(idx).
This version works around some numeric issues (overflow/underflow).
This version works around some numeric issues (overflow/underflow).
Original form: y(i) = exp(x(i)) / sum (exp(x))
Stable form: y(i) = exp( x(i) - logSumExp(x) ) where logSumExp(x) = max(x) + log(sum(x-max(x)))
The index of the currently processed example from the mini-batch. Used to save memory by writing the result directly to predicted(idx).
Initial weights (pre-trained or random) weights(i)(j) is the weight for evidence i based on input parameter j i = 0..nClasses j = 0..nInputs (where nInputs = inputLength + 1 for pseudo-input used for bias handling) Modeled as an array of concatenated rows.
Softmax (multinomial logistic) regression with SGD and AdaGrad
Ref: https://en.wikipedia.org/wiki/Multinomial_logistic_regression http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression http://ufldl.stanford.edu/wiki/index.php/Exercise:Softmax_Regression http://blog.datumbox.com/machine-learning-tutorial-the-multinomial-logistic-regression-softmax-regression/ https://xcorr.net/2014/01/23/adagrad-eliminating-learning-rates-in-stochastic-gradient-descent/ https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad