TopNOneHotEncoder

Transform a collection of categorical features to binary columns, with at most a single
one-value. Only the top N items are tracked.
The list of top N is estimated with Algebird's SketchMap data structure. With probability
at least 1 - delta, this estimate is within eps * N of the true frequency (i.e.,
true frequency <= estimate <= true frequency + eps * N), where N is the total size of the
input collection.
Missing values are either transformed to zero vectors or encoded as __unknown__.
class Object
trait Matchable
class Any

Value members

Methods

def apply(name: String, n: Int, eps: Double, delta: Double, seed: Int, encodeMissingValue: Boolean): Transformer[String, SketchMap[String, Long], SortedMap[String, Int]]
Create a new TopNOneHotEncoder instance.
Value Params
delta
a bound on the probability that a query estimate does not lie within some small
interval (an interval that depends on eps) around the truth
encodeMissingValue
whether to indicate to encode items outside of the top n set as
__unknown__
eps
one-sided error bound on the error of each point query, i.e. frequency estimate
n
number of items to keep track of
seed
a seed to initialize the random number generator used to create the pairwise
independent hash functions
def fromSettings(setting: Settings): Transformer[String, SketchMap[String, Long], SortedMap[String, Int]]
Create a new TopNOneHotEncoder from a settings object
Value Params
setting
Settings object