Package com.cobber.fta
Class TextAnalysisResult
- Object
-
- TextAnalysisResult
-
public class TextAnalysisResult extends Object
TextAnalysisResult is the result of aTextAnalyzer
analysis of a data stream.
-
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description String
asJSON(boolean pretty, int verbose)
A JSON representation of the Analysis.String
asPlugin()
A plugin definition to use to match this type.long
getBlankCount()
Get the count of all blank samples.Set<String>
getBottomK()
Get the bottomK values.int
getCardinality()
Get the cardinality for the current data stream.NavigableMap<String,Long>
getCardinalityDetails()
Get the cardinality details for the current data stream.double
getConfidence()
Confidence in the type classification.AnalysisConfig
getConfig()
Get the configuration associated with this TextAnalysisResult.String
getDataRegExp()
Get the Regular Expression that reflects the non-white space element in the data stream.String
getDataSignature()
A SHA-1 hash that reflects the data stream contents.com.cobber.fta.dates.DateTimeParser.DateResolutionMode
getDateResolutionMode()
Get the DateResolutionMode actually used to process Dates.char
getDecimalSeparator()
Get the Decimal Separator used to interpret Doubles.long
getDistinctCount()
Return the distinct number of valid values in this stream.Histogram.Entry[]
getHistogram(int buckets)
Get the histogram with the supplied number of buckets.int
getInvalidCount()
Get the number of distinct invalid entries for the current data stream.Map<String,Long>
getInvalidDetails()
Get the invalid entry details for the current data stream.double
getKeyConfidence()
Is this field a key?boolean
getLeadingWhiteSpace()
Does the set of elements contain any elements with leading White Space?long
getLeadingZeroCount()
Get the count of all samples with leading zeros (Type long only).long
getMatchCount()
Get the count of all (non-blank/non-null) samples that matched the determined type.int
getMaxLength()
Get the maximum length for Numeric, Boolean and String.String
getMaxValue()
Get the maximum value for Numeric, Boolean and String.Double
getMean()
Get the mean for Numeric types (Long, Double).int
getMinLength()
Get the minimum length for Numeric, Boolean and String.String
getMinValue()
Get the minimum value for Numeric, Boolean and String types.boolean
getMultiline()
Does the set of elements contain any multi-line elements?String
getName()
Name of the data stream being analyzed.long
getNullCount()
Get the count of all null samples.int
getOutlierCount()
Get the number of distinct outliers for the current data stream.Map<String,Long>
getOutlierDetails()
Get the outlier details for the current data stream.String
getRegExp()
Get the Regular Expression that reflects the data stream.long
getSampleCount()
Get the count of all samples observed.String
getSemanticType()
The Semantic Type detected.int
getShapeCount()
Get the number of distinct shapes for the current data stream.Map<String,Long>
getShapeDetails()
Get the shape details for the current data stream.Double
getStandardDeviation()
Get the Standard Deviation for Numeric types (Long, Double).String
getStructureSignature()
A SHA-1 hash that reflects the data stream structure.Set<String>
getTopK()
Get the topK values.long
getTotalBlankCount()
Get the count of all blank elements in the entire data stream (if known).long
getTotalCount()
Get the total number of elements in the entire data stream (if known).int
getTotalMaxLength()
Get the maximum length for Numeric, Boolean and String across the entire data stream (if known).String
getTotalMaxValue()
Get the maximum value for Numeric, Boolean and String across the entire data stream (if known).Double
getTotalMean()
Get the mean for Numeric types (Long, Double) across the entire data stream (if known).int
getTotalMinLength()
Get the minimum length for Numeric, Boolean and String across the entire data stream (if known).String
getTotalMinValue()
Get the minimum value for Numeric, Boolean and String types across the entire data stream (if known).long
getTotalNullCount()
Get the count of all null elements in the entire data stream (if known).Double
getTotalStandardDeviation()
Get the standard deviation for Numeric types (Long, Double) across the entire data stream (if known).boolean
getTrailingWhiteSpace()
Does the set of elements contain any elements with trailing White Space?FTAType
getType()
Get 'Type' as determined by training to date.String
getTypeModifier()
Get the optional Type Modifier (which modifies the Base Type - seegetType()
Predefined qualifiers are: Type: BOOLEAN - "TRUE_FALSE", "YES_NO", "Y_N", "ONE_ZERO" Type: STRING - "BLANK", "BLANKORNULL", "NULL" Type: LONG - "GROUPING", "SIGNED", "SIGNED_TRAILING".double
getUniqueness()
How unique is this field, i.e.String
getValueAtQuantile(double quantile)
Get the value at the requested quantile.String[]
getValuesAtQuantiles(double[] quantiles)
Get the values at the requested quantiles.boolean
isSemanticType()
Is this a Semantic Type?boolean
statisticsEnabled()
Was statistics collection enabled for this analysis.String
toString()
A String representation of the Analysis.
-
-
-
Method Detail
-
getName
public String getName()
Name of the data stream being analyzed.- Returns:
- Name of data stream.
-
getConfig
public AnalysisConfig getConfig()
Get the configuration associated with this TextAnalysisResult.- Returns:
- The AnalysisConfig of the TextAnalysisResult.
-
getConfidence
public double getConfidence()
Confidence in the type classification. Typically this will be the number of matches divided by the number of real samples. Where a real sample does not include either nulls or blanks.- Returns:
- Confidence as a percentage.
-
getType
public FTAType getType()
Get 'Type' as determined by training to date.- Returns:
- The Type of the data stream.
-
getTypeModifier
public String getTypeModifier()
Get the optional Type Modifier (which modifies the Base Type - seegetType()
Predefined qualifiers are:- Type: BOOLEAN - "TRUE_FALSE", "YES_NO", "Y_N", "ONE_ZERO"
- Type: STRING - "BLANK", "BLANKORNULL", "NULL"
- Type: LONG - "GROUPING", "SIGNED", "SIGNED_TRAILING". Note: "GROUPING" and "SIGNED" are independent and can both be present.
- Type: DOUBLE - "GROUPING", "SIGNED", "SIGNED_TRAILING", "NON_LOCALIZED". Note: "GROUPING" and "SIGNED" are independent and can both be present.
- Type: DATE, TIME, DATETIME, ZONEDDATETIME, OFFSETDATETIME - The qualifier is the detailed date format string
- Returns:
- The Type Modifier for the Type.
-
isSemanticType
public boolean isSemanticType()
Is this a Semantic Type?- Returns:
- True if this is a Semantic Type.
-
getSemanticType
public String getSemanticType()
The Semantic Type detected. Note: The Semantic Types detected are based on the set of plugins are installed. For example: If the Month Abbreviation plugin installed, the Base Type will be STRING, and the Semantic Type will be "MONTHABBR".- Returns:
- The Semantic Type detected - only valid if isSemanticType() is true.
-
getMinValue
public String getMinValue()
Get the minimum value for Numeric, Boolean and String types.- Returns:
- The minimum value as a String.
-
getMaxValue
public String getMaxValue()
Get the maximum value for Numeric, Boolean and String.- Returns:
- The maximum value as a String.
-
getMinLength
public int getMinLength()
Get the minimum length for Numeric, Boolean and String. Note: For String and Boolean types this length includes any whitespace.- Returns:
- The minimum length.
-
getMaxLength
public int getMaxLength()
Get the maximum length for Numeric, Boolean and String. Note: For String and Boolean types this length includes any whitespace.- Returns:
- The maximum length.
-
getDecimalSeparator
public char getDecimalSeparator()
Get the Decimal Separator used to interpret Doubles. Note: This will either be the Decimal Separator as per the locale or possibly a period.- Returns:
- The Decimal Separator.
-
getDateResolutionMode
public com.cobber.fta.dates.DateTimeParser.DateResolutionMode getDateResolutionMode()
Get the DateResolutionMode actually used to process Dates.- Returns:
- The DateResolution mode used to process Dates.
-
getMean
public Double getMean()
Get the mean for Numeric types (Long, Double).- Returns:
- The mean.
-
getStandardDeviation
public Double getStandardDeviation()
Get the Standard Deviation for Numeric types (Long, Double).- Returns:
- The Standard Deviation.
-
getValueAtQuantile
public String getValueAtQuantile(double quantile)
Get the value at the requested quantile.- Parameters:
quantile
- a double between 0.0 and 1.0 (both included)- Returns:
- the value at the specified quantile
-
getValuesAtQuantiles
public String[] getValuesAtQuantiles(double[] quantiles)
Get the values at the requested quantiles. Note: The input array must be ordered.- Parameters:
quantiles
- a array of doubles between 0.0 and 1.0 (both included)- Returns:
- the values at the specified quantiles
-
getHistogram
public Histogram.Entry[] getHistogram(int buckets)
Get the histogram with the supplied number of buckets.- Parameters:
buckets
- the number of buckets in the Histogram- Returns:
- An array of length 'buckets' that constitutes the Histogram
-
getTopK
public Set<String> getTopK()
Get the topK values.- Returns:
- The top K values (default: 10).
-
getBottomK
public Set<String> getBottomK()
Get the bottomK values.- Returns:
- The bottom K values (default: 10).
-
getRegExp
public String getRegExp()
Get the Regular Expression that reflects the data stream. All valid inputs should match this Regular Expression, however in some instances, not all inputs that match this RE are necessarily valid. For example, 28/13/2017 will match the RE (\d{2}/\d{2}/\d{4}) however this is not a valid date with pattern dd/MM/yyyy (there is no 13th month).- Returns:
- The Regular Expression.
-
getDataRegExp
public String getDataRegExp()
Get the Regular Expression that reflects the non-white space element in the data stream. For example, if a stream contains ' hello' and 'world ' this would return '(?i)(HELLO|WORLD)'.- Returns:
- The Regular Expression reflecting the non-white space data.
-
getMatchCount
public long getMatchCount()
Get the count of all (non-blank/non-null) samples that matched the determined type. More formally the SampleCount is equal to the MatchCount + BlankCount + NullCount.- Returns:
- Count of all matches.
-
getLeadingWhiteSpace
public boolean getLeadingWhiteSpace()
Does the set of elements contain any elements with leading White Space?- Returns:
- True if any elements matched have leading White Space.
-
getTrailingWhiteSpace
public boolean getTrailingWhiteSpace()
Does the set of elements contain any elements with trailing White Space?- Returns:
- True if any elements matched have trailing White Space.
-
getMultiline
public boolean getMultiline()
Does the set of elements contain any multi-line elements?- Returns:
- True if any elements matched are multi-line.
-
getTotalCount
public long getTotalCount()
Get the total number of elements in the entire data stream (if known).- Returns:
- total number of elements in the entire data stream (-1 if not known).
-
getTotalNullCount
public long getTotalNullCount()
Get the count of all null elements in the entire data stream (if known). UsegetNullCount()
for the equivalent on the sample set.- Returns:
- Count of all null elements in the entire data stream (-1 if not known).
-
getTotalBlankCount
public long getTotalBlankCount()
Get the count of all blank elements in the entire data stream (if known). Note: any number (including zero) of spaces are Blank. UsegetBlankCount()
for the equivalent on the sample set.- Returns:
- Count of all blank samples in the entire data stream (-1 if not known).
-
getTotalMean
public Double getTotalMean()
Get the mean for Numeric types (Long, Double) across the entire data stream (if known). UsegetMean()
for the equivalent on the sample set.- Returns:
- The mean across the entire data stream (null if not known).
-
getTotalStandardDeviation
public Double getTotalStandardDeviation()
Get the standard deviation for Numeric types (Long, Double) across the entire data stream (if known). UsegetStandardDeviation()
for the equivalent on the sample set.- Returns:
- The Standard Deviation across the entire data stream (null if not known).
-
getTotalMinValue
public String getTotalMinValue()
Get the minimum value for Numeric, Boolean and String types across the entire data stream (if known). UsegetMinValue()
for the equivalent on the sample set.- Returns:
- The minimum value as a String (null if not known).
-
getTotalMaxValue
public String getTotalMaxValue()
Get the maximum value for Numeric, Boolean and String across the entire data stream (if known). UsegetMaxValue()
for the equivalent on the sample set.- Returns:
- The maximum value as a String (null if not known).
-
getTotalMinLength
public int getTotalMinLength()
Get the minimum length for Numeric, Boolean and String across the entire data stream (if known). Note: For String and Boolean types this length includes any whitespace. UsegetMinLength()
for the equivalent on the sample set.- Returns:
- The minimum length in the entire Data Stream (-1 if not known).
-
getTotalMaxLength
public int getTotalMaxLength()
Get the maximum length for Numeric, Boolean and String across the entire data stream (if known). Note: For String and Boolean types this length includes any whitespace. UsegetMaxLength()
for the equivalent on the sample set.- Returns:
- The maximum length in the entire Data Stream (-1 if not known).
-
getSampleCount
public long getSampleCount()
Get the count of all samples observed.- Returns:
- Count of all samples observed.
-
getNullCount
public long getNullCount()
Get the count of all null samples.- Returns:
- Count of all null samples.
-
getBlankCount
public long getBlankCount()
Get the count of all blank samples. Note: any number (including zero) of spaces are Blank.- Returns:
- Count of all blank samples.
-
getLeadingZeroCount
public long getLeadingZeroCount()
Get the count of all samples with leading zeros (Type long only). Note: a single '0' does not constitute a sample with a leading zero.- Returns:
- Count of all leading zero samples.
-
getCardinality
public int getCardinality()
Get the cardinality for the current data stream. SeesetMaxCardinality()
method in TextAnalyzer. Note: The cardinality returned is the cardinality of the valid samples. For example, if a date is invalid it will not be included in the cardinality. Note: This is not a complete cardinality analysis unless the cardinality of the data stream is less than the maximum cardinality (Default: 12000). See alsosetMaxCardinality()
method in TextAnalyzer.- Returns:
- Count of all blank samples.
-
getCardinalityDetails
public NavigableMap<String,Long> getCardinalityDetails()
Get the cardinality details for the current data stream. This is a Map of Strings and the count of occurrences.- Returns:
- A Map of values and their occurrence frequency of the data stream to date.
-
getOutlierCount
public int getOutlierCount()
Get the number of distinct outliers for the current data stream. SeesetMaxOutliers()
method in TextAnalyzer. Note: This is not a complete outlier analysis unless the outlier count of the data stream is less than the maximum outlier count (Default: 50). See alsosetMaxOutliers()
method in TextAnalyzer.- Returns:
- Count of the distinct outliers.
-
getOutlierDetails
public Map<String,Long> getOutlierDetails()
Get the outlier details for the current data stream. This is a Map of Strings and the count of occurrences.- Returns:
- A Map of values and their occurrence frequency of the data stream to date.
-
getInvalidCount
public int getInvalidCount()
Get the number of distinct invalid entries for the current data stream. SeesetMaxOutliers()
method in TextAnalyzer. Note: This is not a complete invalid analysis unless the invalid count of the data stream is less than the maximum invalid count (Default: 50).- Returns:
- Count of the distinct invalid entries.
-
getInvalidDetails
public Map<String,Long> getInvalidDetails()
Get the invalid entry details for the current data stream. This is a Map of Strings and the count of occurrences.- Returns:
- A Map of values and their occurrence frequency of the data stream to date.
-
getShapeCount
public int getShapeCount()
Get the number of distinct shapes for the current data stream. Note: This is not a complete shape analysis unless the shape count of the data stream is less than the maximum shape count (Default: 400).- Returns:
- Count of the distinct shapes.
-
getShapeDetails
public Map<String,Long> getShapeDetails()
Get the shape details for the current data stream. This is a Map of Strings and the count of occurrences.- Returns:
- A Map of shapes and their occurrence frequency of the data stream to date.
-
getKeyConfidence
public double getKeyConfidence()
Is this field a key?- Returns:
- A Double (0.0 ... 1.0) representing our confidence that this field is a key.
-
getUniqueness
public double getUniqueness()
How unique is this field, i.e. the number of elements in the set with a cardinality of one / cardinality. Note: Only supported if the cardinality presented is less than Max Cardinality.- Returns:
- A Double (0.0 ... 1.0) representing the uniqueness of this field.
-
getDistinctCount
public long getDistinctCount()
Return the distinct number of valid values in this stream. Note: Typically only supported if the cardinality presented is less than Max Cardinality. May be set by an external source.- Returns:
- A long with the number of distinct values in this stream or -1 if unknown.
-
statisticsEnabled
public boolean statisticsEnabled()
Was statistics collection enabled for this analysis.- Returns:
- True if statistics were collected.
-
toString
public String toString()
A String representation of the Analysis.- Overrides:
toString
in classObject
- Returns:
- A String representation of the analysis to date.
-
getStructureSignature
public String getStructureSignature()
A SHA-1 hash that reflects the data stream structure. Note: If a Semantic type is detected then the SHA-1 hash will reflect this.- Returns:
- A String SHA-1 hash that reflects the structure of the data stream.
-
getDataSignature
public String getDataSignature()
A SHA-1 hash that reflects the data stream contents. Note: The order of the data stream is not considered.- Returns:
- A String SHA-1 hash that reflects the data stream contents.
-
asPlugin
public String asPlugin()
A plugin definition to use to match this type.- Returns:
- A JSON representation of the analysis.
-
asJSON
public String asJSON(boolean pretty, int verbose)
A JSON representation of the Analysis.- Parameters:
pretty
- If set, add minimal whitespace formatting.verbose
- If > 0 provides additional details on the core, Outlier, and Shapes sets. A value of 1 will output the first 100 elements, a value > 1 will output the full set.- Returns:
- A JSON representation of the analysis.
-
-