HashNHotEncoder
Transform a collection of categorical features to binary columns, with at most N one-values.
Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU
and memory overhead.
Similar to NHotEncoder but uses MurmursHash3 to hash features into buckets to reduce CPU
and memory overhead.
Missing values are transformed to zero vectors.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.
the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:
English words:
{{{
sizeScalingFactor % Collisions
----------------- ------------
sizeScalingFactor % Collisions
2 17.9934%
4 10.5686%
8 5.7236%
16 3.0019%
32 1.5313%
64 0.7864%
128 0.3920%
256 0.1998%
512 0.0975%
1024 0.0478%
2048 0.0236%
4096 0.0071%
}}}
Value members
Methods
def apply(name: String, hashBucketSize: Int, sizeScalingFactor: Double): Transformer[Seq[String], HLL, Int]
Create a new HashNHotEncoder instance.
- Value Params
- hashBucketSize
-
number of buckets, or 0 to infer from data with HyperLogLog
- sizeScalingFactor
-
when hashBucketSize is 0, scale HLL estimate by this amount