Container for categorical stats coming from a single group (and therefore a single contingency matrix)
Categories of feature vector columns to exclude from the feature-label correlation matrix (or just array of feature-label correlations) calculated inSanityChecker.
Settings for feature - feature correlations
Represents a kind of correlation coefficient.
Correlations between features and the label from SanityChecker
Correlations between features and the label from SanityChecker
names of features
correlation of feature with label
correlations between features
type of correlation done on
The MinVarianceFilter checks that computed features have a minimum variance
The MinVarianceFilter checks that computed features have a minimum variance
Like SanityChecker, the Estimator step outputs statistics on incoming data, as well as the names of features which should be dropped from the feature vector. And the transformer step applies the action of actually removing the low variance features from the feature vector
Two distinctions from SanityChecker: (1) no label column as input; and (2) only filters features by variance
Case class to store metadata from MinVarianceFilter
Case class to store metadata from MinVarianceFilter
features dropped by minimum variance filter
stats on features
names of features passed in
Estimator which takes response feature and predinction feature as inputs.
Estimator which takes response feature and predinction feature as inputs. It deindexes the pred by using response's metadata
Input 1 : response Input 2 : pred feature
The SanityChecker checks for potential problems with computed features in a supervised learning setting.
The SanityChecker checks for potential problems with computed features in a supervised learning setting.
There is an Estimator step, which outputs statistics on the incoming data, as well as the names of features which should be dropped from the feature vector. The transformer step applies the action of actually removing the offending features from the feature vector.
Case class to convert to and from SanityChecker summary metadata
Case class to convert to and from SanityChecker summary metadata
feature correlations with label
features dropped for label leakage
stats on features
names of features passed in
Statistics on features (zip arrays with names in SanityCheckerSummary to get feature associated with values)
Statistics on features (zip arrays with names in SanityCheckerSummary to get feature associated with values)
count of data in sample used to calculate stats
fraction of total data used in calculation
max value seen
min value
mean value
variance of value
Container class for statistics calculated from contingency tables constructed from categorical variables
Container class for statistics calculated from contingency tables constructed from categorical variables
Names of features that we performed categorical tests on
Values of cramersV for each feature (should be the same for everything coming from the same contingency matrix)
Map from label value (as a string) to an Array (over features) of PMI values
Values of MI for each feature (should be the same for everything coming from the same contingency matrix)
Counts of occurrence for categoricals (n x m array of arrays where n = number of labels and m = number of features + 1 with last element being occurrence count of labels
(Since version 3.3.0) Functionality replaced by Array[CategoricalGroupStats]
Contains all names for sanity checker metadata
Container for categorical stats coming from a single group (and therefore a single contingency matrix)
Indicator group for this contingency matrix
Array of categorical features belonging to this group
Contingency matrix for this feature group
Matrix of PMI values in Map form (label -> PMI values)
Cramer's V value for this feature group (how strongly correlated is it with the label)
Mutual info value for this feature group
Array (one value per contingency matrix row) containing the largest association rule confidence for that row (over all the labels)
Array (one value per contingency matrix row) containing the supports for each categorical choice (fraction of dats in which it is chosen)