com.eharmony.aloha.dataset.vw.multilabel
all labels in the training set. This is a sequence because order matters. Order here can be chosen arbitrarily, but it must be consistent in the training and test formulation.
features to extract from the data of type A
.
list of feature indices in the default VW namespace.
a mapping from VW namespace name to feature indices in that namespace.
can modify VW output (currently unused)
A method that can extract positive class labels.
the namespace name for class information.
the namespace name for dummy class information. 2 dummy classes are added to make the predicted probabilities work.
a positive value representing the number of negative labels to include in each row. If this is less than the number of negative examples for a given row, then no downsampling of negatives will take place.
a "function" that creates a seed that will be used for randomness.
The implementation of this function is important. It should create a
unique value for each unit of parallelism. If for example, row
creation is parallelized across multiple threads on one machine, the
unit of parallelism is threads and seedCreator
should produce unique
values for each thread. If row creation is parallelized across multiple
machines, the seedCreator
should produce a unique value for each
machine. If row creation is parallelized across machines and threads on
each machine, the seedCreator
should create unique values for each
thread on each machine. Otherwise, randomness will be striped which
is bad.
include zero values in VW input?
all labels in the training set.
all labels in the training set. This is a sequence because order matters. Order here can be chosen arbitrarily, but it must be consistent in the training and test formulation.
Given an a
and some seed
, produce output, including a new seed.
Given an a
and some seed
, produce output, including a new seed.
When using this function, the user is responsible for keeping track of, and providing the seeds.
The implementation of this function should be referentially transparent.
input
the random seed which is updated on each call.
a tuple where the first element is a Tuple2 whose first element is missing and error information and second element is an optional result. The second element of the outer Tuple2 is the new state.
the namespace name for class information.
Issue a debug logging message, with an exception.
Issue a debug logging message, with an exception.
the message object. toString()
is called to convert it
to a loggable string.
the exception to include with the logged message.
Issue a debug logging message.
Issue a debug logging message.
the message object. toString()
is called to convert it
to a loggable string.
list of feature indices in the default VW namespace.
the namespace name for dummy class information.
the namespace name for dummy class information. 2 dummy classes are added to make the predicted probabilities work.
Issue a error logging message, with an exception.
Issue a error logging message, with an exception.
the message object. toString()
is called to convert it
to a loggable string.
the exception to include with the logged message.
Issue a error logging message.
Issue a error logging message.
the message object. toString()
is called to convert it
to a loggable string.
features to extract from the data of type A
.
include zero values in VW input?
Issue a info logging message, with an exception.
Issue a info logging message, with an exception.
the message object. toString()
is called to convert it
to a loggable string.
the exception to include with the logged message.
Issue a info logging message.
Issue a info logging message.
the message object. toString()
is called to convert it
to a loggable string.
Some initial state that can be used on the very first call to apply(A, S)
.
Some initial state that can be used on the very first call to apply(A, S)
.
some state.
Determine whether debug logging is enabled.
Determine whether debug logging is enabled.
Determine whether error logging is enabled.
Determine whether error logging is enabled.
Determine whether info logging is enabled.
Determine whether info logging is enabled.
Determine whether trace logging is enabled.
Determine whether trace logging is enabled.
Determine whether warn logging is enabled.
Determine whether warn logging is enabled.
The logger is a @transient lazy val
to enable proper working with Spark.
The logger is a @transient lazy val
to enable proper working with Spark.
The logger will not be serialized with the rest of the class with which this
trait is mixed-in.
The name with which the logger is initialized.
The name with which the logger is initialized. This can be overridden in a derived class.
Get the name associated with this logger.
Get the name associated with this logger.
the name.
a mapping from VW namespace name to feature indices in that namespace.
can modify VW output (currently unused)
a positive value representing the number of negative labels to include in each row.
a positive value representing the number of negative labels to include in each row. If this is less than the number of negative examples for a given row, then no downsampling of negatives will take place.
A method that can extract positive class labels.
a "function" that creates a seed that will be used for randomness.
a "function" that creates a seed that will be used for randomness.
The implementation of this function is important. It should create a
unique value for each unit of parallelism. If for example, row
creation is parallelized across multiple threads on one machine, the
unit of parallelism is threads and seedCreator
should produce unique
values for each thread. If row creation is parallelized across multiple
machines, the seedCreator
should produce a unique value for each
machine. If row creation is parallelized across machines and threads on
each machine, the seedCreator
should create unique values for each
thread on each machine. Otherwise, randomness will be striped which
is bad.
Apply the apply(A, S)
method to the elements of the sequence.
Apply the apply(A, S)
method to the elements of the sequence. In the first
application of apply(A, S)
, state
will be used as the state. In subsequent
applications, the state will come from the state generated in the output of the
previous application of apply(A, S)
.
NOTE: This method isn't really parallelizable via chunking. The way to parallelize this method is to provide a separate starting state for each unit of parallelism.
For more information, see com.eharmony.aloha.util.StatefulMapOps
input to map.
the initial state to use at the start of mapping.
object responsible for building the output collection.
Apply the apply(A, S)
method to the elements of the iterator.
Apply the apply(A, S)
method to the elements of the iterator. In the first
application of apply(A, S)
, state
will be used as the state. In subsequent
applications, the state will come from the state generated in the output of the
previous application of apply(A, S)
.
For more information, see com.eharmony.aloha.util.StatefulMapOps
Note the first element of as
will be forced in this method in order
to construct the output.
the initial state to use at the start of the iterator.
an iterator containing the a
mapped to a
(MissingAndErroneousFeatureInfo, Option[B])
along with the resulting
state that is created in the process.
Issue a trace logging message, with an exception.
Issue a trace logging message, with an exception.
the message object. toString()
is called to convert it
to a loggable string.
the exception to include with the logged message.
Issue a trace logging message.
Issue a trace logging message.
the message object. toString()
is called to convert it
to a loggable string.
Issue a warn logging message, with an exception.
Issue a warn logging message, with an exception.
the message object. toString()
is called to convert it
to a loggable string.
the exception to include with the logged message.
Issue a warn logging message.
Issue a warn logging message.
the message object. toString()
is called to convert it
to a loggable string.
Creates training data for multilabel models in Vowpal Wabbit's CSOAA LDF and WAP LDF format for the JNI. In this row creator, negative labels are downsampled and costs for the downsampled labels are adjusted to produced an unbiased estimator. It is assumed that negative labels are in the majority. Downsampling negatives can improve both training time and possibly model performance. See the following resources for intuition:
This row creator, since it is stateful, requires the caller to maintain state. If however, it is only called via an iterator or sequence, then this row creator can maintain the state during iteration over the iterator or sequence. In the case of iterators, the mapping is non-strict and in the case of sequences (
Seq
), it is strict.the input type
the label or class type
all labels in the training set. This is a sequence because order matters. Order here can be chosen arbitrarily, but it must be consistent in the training and test formulation.
features to extract from the data of type
A
.list of feature indices in the default VW namespace.
a mapping from VW namespace name to feature indices in that namespace.
can modify VW output (currently unused)
A method that can extract positive class labels.
the namespace name for class information.
the namespace name for dummy class information. 2 dummy classes are added to make the predicted probabilities work.
a positive value representing the number of negative labels to include in each row. If this is less than the number of negative examples for a given row, then no downsampling of negatives will take place.
a "function" that creates a seed that will be used for randomness. The implementation of this function is important. It should create a unique value for each unit of parallelism. If for example, row creation is parallelized across multiple threads on one machine, the unit of parallelism is threads and
seedCreator
should produce unique values for each thread. If row creation is parallelized across multiple machines, theseedCreator
should produce a unique value for each machine. If row creation is parallelized across machines and threads on each machine, theseedCreator
should create unique values for each thread on each machine. Otherwise, randomness will be striped which is bad.include zero values in VW input?
11/6/2017