TopNOneHotEncoder
Transform a collection of categorical features to binary columns, with at most a single
one-value. Only the top N items are tracked.
one-value. Only the top N items are tracked.
The list of top N is estimated with Algebird's SketchMap data structure. With probability
at least
input collection.
at least
1 - delta
, this estimate is within eps * N
of the true frequency (i.e.,true frequency <= estimate <= true frequency + eps * N
), where N is the total size of theinput collection.
Missing values are either transformed to zero vectors or encoded as
__unknown__
.Value members
Methods
def apply(name: String, n: Int, eps: Double, delta: Double, seed: Int, encodeMissingValue: Boolean): Transformer[String, SketchMap[String, Long], SortedMap[String, Int]]
Create a new TopNOneHotEncoder instance.
- Value Params
- delta
-
a bound on the probability that a query estimate does not lie within some small
interval (an interval that depends oneps
) around the truth - encodeMissingValue
-
whether to indicate to encode items outside of the top n set as
__unknown__
- eps
-
one-sided error bound on the error of each point query, i.e. frequency estimate
- n
-
number of items to keep track of
- seed
-
a seed to initialize the random number generator used to create the pairwise
independent hash functions
def fromSettings(setting: Settings): Transformer[String, SketchMap[String, Long], SortedMap[String, Int]]
Create a new TopNOneHotEncoder from a settings object
- Value Params
- setting
-
Settings object