com.twitter.finatra.kafkastreams.dsl.FinatraDslSampling
Counts and samples an attribute of a stream of records.
Counts and samples an attribute of a stream of records.
This transformer uses two state stores:
numCountsStore is a KeyValueStore[SampleKey, Long] which stores a SampleKey and the total number of times that SampleKey was seen.
sampleStore is a KeyValueStore[IndexedSampleKey[SampleKey], SampleValue] which stores the samples themselves. The Key is an IndexedSampleKey, which is your sample key wrapped with an index of 0..sampleSize. The value is the SampleValue that you want to sample.
Example: if you had a stream of Interaction(engagingUserId, engagementType) and you wanted a sample of users who performed each engagement type, then your sampleKey would be engagementType and your sampleValue would be userId.
Incoming stream: (engagingUserId = 12, engagementType = Displayed) (engagingUserId = 100, engagementType = Favorited) (engagingUserId = 101, engagementType = Favorited) (engagingUserId = 12, engagementType = Favorited)
This is what the numCountStore table would look like: Sample Key is EngagementType,
|-----------|-------| | SampleKey | Count | |-----------|-------| | Displayed | 1 | | Favorited | 3 | |-----------|-------|
This is what the sampleStore table would look like: SampleKey is EngagementType SampleValue is engaging user id
|-----------------------------|-------------| | IndexedSampleKey[SampleKey] | SampleValue | |-----------------------------|-------------| | (Displayed, index = 0) | 12 | | (Favorited, index = 0) | 100 | | (Favorited, index = 1) | 101 | | (Favorited, index = 2) | 102 | |-----------------------------|-------------|
If you want to reference the sample store(so that you can query it) the name of the store can
be found by calling SamplingUtils.getSampleStoreName(sampleName
. You can reference the
name of the count store by calling SamplingUtils.getNumCountsStoreName(sampleName)
*Note* This method will create the state stores for you.
the type of the SampleVaule
returns the key of the sample
returns the type that you want to sample
the size of the sample
the amount of time after creation that a sample should be expired
the name of the sample
the serde for the SampleKey
the serde for the SampleValue
a stream of SampleKey and SampleValue