ConvolutionMode defines how convolution operations should be executed for Convolutional and Subsampling layers,
for a given input size and network configuration (specifically stride/padding/kernel sizes).
Currently, 3 modes are provided:
Strict: Output size for Convolutional and Subsampling layers are calculated as follows, in each dimension:
outputSize = (inputSize - kernelSize + 2*padding) / stride + 1. If outputSize is not an integer, an exception will
be thrown during network initialization or forward pass.
Truncate: Output size for Convolutional and Subsampling layers are calculated in the same way as in Strict (that
is, outputSize = (inputSize - kernelSize + 2*padding) / stride + 1) in each dimension.
If outputSize is an integer, then Strict and Truncate are identical. However, if outputSize is
not an integer,
the output size will be rounded down to an integer value.
Specifically, ConvolutionMode.Truncate implements the following:
output height = floor((inputHeight - kernelHeight + 2*paddingHeight) / strideHeight) + 1.
output width = floor((inputWidth - kernelWidth + 2*paddingWidth) / strideWidth) + 1.
where 'floor' is the floor operation (i.e., round down to the nearest integer).
The major consequence of this rounding down: a border/edge effect will be seen if/when rounding down is required.
In effect, some number of inputs along the given dimension (height or width) will not be used as input and hence
some input activations can be lost/ignored. This can be problematic higher in the network (where the cropped activations
may represent a significant proportion of the original input), or with large kernel sizes and strides.
In the given dimension (height or width) the number of truncated/cropped input values is equal to
(inputSize - kernelSize + 2*padding) % stride. (where % is the modulus/remainder operation).
Same: Same mode operates differently to Strict/Truncate, in three key ways:
(a) Manual padding values in convolution/subsampling layer configuration is not used; padding values are instead calculated
automatically based on the input size, kernel size and strides.
(b) The output sizes are calculated differently (see below) compared to Strict/Truncate. Most notably, when stride = 1
the output size is the same as the input size.
(c) The calculated padding values may different for top/bottom, and left/right (when they do differ: right and bottom
may have 1 pixel/row/column more than top/left padding)
The output size of a Convolutional/Subsampling layer using ConvolutionMode.Same is calculated as follows:
output height = ceil( inputHeight / strideHeight )
output width = ceil( inputWidth / strideWidth )
where 'ceil' is the ceiling operation (i.e., round up to the nearest integer).
The padding for top/bottom and left/right are automatically calculated as follows:
totalHeightPadding = (outputHeight - 1) * strideHeight + filterHeight - inputHeight
totalWidthPadding = (outputWidth - 1) * strideWidth + filterWidth - inputWidth
topPadding = totalHeightPadding / 2 (note: integer division)
bottomPadding = totalHeightPadding - topPadding
leftPadding = totalWidthPadding / 2 (note: integer division)
rightPadding = totalWidthPadding - leftPadding
Note that if top/bottom padding differ, then bottomPadding = topPadding + 1
Causal: Causal padding mode can only be used for 1D convolutional neural networks.
The motivation behind causal padding mode is that the output time steps depend only on current and past time steps.
That is, out[t] (for time t) depends on only on values in[T] for t < T
The output size of 1D convolution/subsampling layers is the same as with SAME convolution mode -
i.e., outSize = ceil( inputSize / stride )
Padding is also the same as SAME mode, but all padding in on the left (start of sequence) instead of being on both
left and right of the input
For more details on causal convolutions, see
WaveNet: A Generative Model For Audio,
section 2.1.
For further information on output sizes for convolutional neural networks, see the "Spatial arrangement" section at
http://cs231n.github.io/convolutional-networks/