HashNHotWeightedEncoder

Transform a collection of weighted categorical features to columns of weight sums, with at
most N values. Similar to NHotWeightedEncoder but uses MurmursHash3 to hash features into
buckets to reduce CPU and memory overhead.
Weights of the same labels in a row are summed instead of 1.0 as is the case with the normal
NHotEncoder.
If hashBucketSize is inferred with HLL, the estimate is scaled by sizeScalingFactor to reduce
the number of collisions.
Rough table of relationship of scaling factor to % collisions, measured from a corpus of 466544
English words:
{{{
sizeScalingFactor % Collisions
----------------- ------------
           2     17.9934%
           4     10.5686%
           8      5.7236%
          16      3.0019%
          32      1.5313%
          64      0.7864%
         128      0.3920%
         256      0.1998%
         512      0.0975%
        1024      0.0478%
        2048      0.0236%
        4096      0.0071%
}}}
class Object
trait Matchable
class Any

Value members

Methods

def apply(name: String, hashBucketSize: Int, sizeScalingFactor: Double): Transformer[Seq[WeightedLabel], HLL, Int]
Create a new HashNHotWeightedEncoder instance.
Value Params
hashBucketSize
number of buckets, or 0 to infer from data with HyperLogLog
sizeScalingFactor
when hashBucketSize is 0, scale HLL estimate by this amount
def fromSettings(setting: Settings): Transformer[Seq[WeightedLabel], HLL, Int]
Create a new HashOneHotEncoder from a settings object
Value Params
setting
Settings object