Class TextAnalysisResult


  • public class TextAnalysisResult
    extends Object
    TextAnalysisResult is the result of a TextAnalyzer analysis of a data stream.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      String asJSON​(boolean pretty, int verbose)
      A JSON representation of the Analysis.
      String asPlugin()
      A plugin definition to use to match this type.
      long getBlankCount()
      Get the count of all blank samples.
      Set<String> getBottomK()
      Get the bottomK values.
      int getCardinality()
      Get the cardinality for the current data stream.
      NavigableMap<String,​Long> getCardinalityDetails()
      Get the cardinality details for the current data stream.
      double getConfidence()
      Confidence in the type classification.
      AnalysisConfig getConfig()
      Get the configuration associated with this TextAnalysisResult.
      String getDataRegExp()
      Get the Regular Expression that reflects the non-white space element in the data stream.
      String getDataSignature()
      A SHA-1 hash that reflects the data stream contents.
      com.cobber.fta.dates.DateTimeParser.DateResolutionMode getDateResolutionMode()
      Get the DateResolutionMode actually used to process Dates.
      char getDecimalSeparator()
      Get the Decimal Separator used to interpret Doubles.
      long getDistinctCount()
      Return the distinct number of valid values in this stream.
      Histogram.Entry[] getHistogram​(int buckets)
      Get the histogram with the supplied number of buckets.
      int getInvalidCount()
      Get the number of distinct invalid entries for the current data stream.
      Map<String,​Long> getInvalidDetails()
      Get the invalid entry details for the current data stream.
      double getKeyConfidence()
      Is this field a key?
      boolean getLeadingWhiteSpace()
      Does the set of elements contain any elements with leading White Space?
      long getLeadingZeroCount()
      Get the count of all samples with leading zeros (Type long only).
      long getMatchCount()
      Get the count of all (non-blank/non-null/non-outlier/non-invalid) samples that matched the determined type.
      int getMaxLength()
      Get the maximum length for Numeric, Boolean and String.
      String getMaxValue()
      Get the maximum value for Numeric, Boolean and String.
      Double getMean()
      Get the mean for Numeric types (Long, Double).
      int getMinLength()
      Get the minimum length for Numeric, Boolean and String.
      String getMinValue()
      Get the minimum value for Numeric, Boolean and String types.
      boolean getMultiline()
      Does the set of elements contain any multi-line elements?
      String getName()
      Name of the data stream being analyzed.
      long getNullCount()
      Get the count of all null samples.
      int getOutlierCount()
      Get the number of distinct outliers for the current data stream.
      Map<String,​Long> getOutlierDetails()
      Get the outlier details for the current data stream.
      String getRegExp()
      Get the Regular Expression that reflects the data stream.
      long getSampleCount()
      Get the count of all samples observed.
      String getSemanticType()
      The Semantic Type detected.
      int getShapeCount()
      Get the number of distinct shapes for the current data stream.
      Map<String,​Long> getShapeDetails()
      Get the shape details for the current data stream.
      Double getStandardDeviation()
      Get the Standard Deviation for Numeric types (Long, Double).
      String getStructureSignature()
      A SHA-1 hash that reflects the data stream structure.
      Set<String> getTopK()
      Get the topK values.
      long getTotalBlankCount()
      Get the count of all blank elements in the entire data stream (if known).
      long getTotalCount()
      Get the total number of elements in the entire data stream (if known).
      int getTotalMaxLength()
      Get the maximum length for Numeric, Boolean and String across the entire data stream (if known).
      String getTotalMaxValue()
      Get the maximum value for Numeric, Boolean and String across the entire data stream (if known).
      Double getTotalMean()
      Get the mean for Numeric types (Long, Double) across the entire data stream (if known).
      int getTotalMinLength()
      Get the minimum length for Numeric, Boolean and String across the entire data stream (if known).
      String getTotalMinValue()
      Get the minimum value for Numeric, Boolean and String types across the entire data stream (if known).
      long getTotalNullCount()
      Get the count of all null elements in the entire data stream (if known).
      Double getTotalStandardDeviation()
      Get the standard deviation for Numeric types (Long, Double) across the entire data stream (if known).
      boolean getTrailingWhiteSpace()
      Does the set of elements contain any elements with trailing White Space?
      FTAType getType()
      Get 'Type' as determined by training to date.
      String getTypeModifier()
      Get the optional Type Modifier (which modifies the Base Type - see getType() Predefined qualifiers are: Type: BOOLEAN - "TRUE_FALSE", "YES_NO", "Y_N", "ONE_ZERO" Type: STRING - "BLANK", "BLANKORNULL", "NULL" Type: LONG - "GROUPING", "SIGNED", "SIGNED_TRAILING".
      double getUniqueness()
      How unique is this field, i.e.
      String getValueAtQuantile​(double quantile)
      Get the value at the requested quantile.
      String[] getValuesAtQuantiles​(double[] quantiles)
      Get the values at the requested quantiles.
      boolean isSemanticType()
      Is this a Semantic Type?
      boolean statisticsEnabled()
      Was statistics collection enabled for this analysis.
      String toString()
      A String representation of the Analysis.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Method Detail

      • getName

        public String getName()
        Name of the data stream being analyzed.
        Returns:
        Name of data stream.
      • getConfig

        public AnalysisConfig getConfig()
        Get the configuration associated with this TextAnalysisResult.
        Returns:
        The AnalysisConfig of the TextAnalysisResult.
      • getConfidence

        public double getConfidence()
        Confidence in the type classification. Typically this will be the number of matches divided by the number of real samples. Where a real sample does not include either nulls or blanks.
        Returns:
        Confidence as a percentage.
      • getType

        public FTAType getType()
        Get 'Type' as determined by training to date.
        Returns:
        The Type of the data stream.
      • getTypeModifier

        public String getTypeModifier()
        Get the optional Type Modifier (which modifies the Base Type - see getType() Predefined qualifiers are:
        • Type: BOOLEAN - "TRUE_FALSE", "YES_NO", "Y_N", "ONE_ZERO"
        • Type: STRING - "BLANK", "BLANKORNULL", "NULL"
        • Type: LONG - "GROUPING", "SIGNED", "SIGNED_TRAILING". Note: "GROUPING" and "SIGNED" are independent and can both be present.
        • Type: DOUBLE - "GROUPING", "SIGNED", "SIGNED_TRAILING", "NON_LOCALIZED". Note: "GROUPING" and "SIGNED" are independent and can both be present.
        • Type: DATE, TIME, DATETIME, ZONEDDATETIME, OFFSETDATETIME - The qualifier is the detailed date format string
        Note: Boolean TRUE_FALSE is not localized, i.e. it will only be detected if the field contains true/false respectively.
        Returns:
        The Type Modifier for the Type.
      • isSemanticType

        public boolean isSemanticType()
        Is this a Semantic Type?
        Returns:
        True if this is a Semantic Type.
      • getSemanticType

        public String getSemanticType()
        The Semantic Type detected. Note: The Semantic Types detected are based on the set of plugins are installed. For example: If the Month Abbreviation plugin installed, the Base Type will be STRING, and the Semantic Type will be "MONTHABBR".
        Returns:
        The Semantic Type detected - only valid if isSemanticType() is true.
      • getMinValue

        public String getMinValue()
        Get the minimum value for Numeric, Boolean and String types.
        Returns:
        The minimum value as a String.
      • getMaxValue

        public String getMaxValue()
        Get the maximum value for Numeric, Boolean and String.
        Returns:
        The maximum value as a String.
      • getMinLength

        public int getMinLength()
        Get the minimum length for Numeric, Boolean and String. Note: For String and Boolean types this length includes any whitespace.
        Returns:
        The minimum length.
      • getMaxLength

        public int getMaxLength()
        Get the maximum length for Numeric, Boolean and String. Note: For String and Boolean types this length includes any whitespace.
        Returns:
        The maximum length.
      • getDecimalSeparator

        public char getDecimalSeparator()
        Get the Decimal Separator used to interpret Doubles. Note: This will either be the Decimal Separator as per the locale or possibly a period.
        Returns:
        The Decimal Separator.
      • getDateResolutionMode

        public com.cobber.fta.dates.DateTimeParser.DateResolutionMode getDateResolutionMode()
        Get the DateResolutionMode actually used to process Dates.
        Returns:
        The DateResolution mode used to process Dates.
      • getMean

        public Double getMean()
        Get the mean for Numeric types (Long, Double).
        Returns:
        The mean.
      • getStandardDeviation

        public Double getStandardDeviation()
        Get the Standard Deviation for Numeric types (Long, Double).
        Returns:
        The Standard Deviation.
      • getValueAtQuantile

        public String getValueAtQuantile​(double quantile)
        Get the value at the requested quantile.
        Parameters:
        quantile - a double between 0.0 and 1.0 (both included)
        Returns:
        the value at the specified quantile
      • getValuesAtQuantiles

        public String[] getValuesAtQuantiles​(double[] quantiles)
        Get the values at the requested quantiles. Note: The input array must be ordered.
        Parameters:
        quantiles - a array of doubles between 0.0 and 1.0 (both included)
        Returns:
        the values at the specified quantiles
      • getHistogram

        public Histogram.Entry[] getHistogram​(int buckets)
        Get the histogram with the supplied number of buckets.
        Parameters:
        buckets - the number of buckets in the Histogram
        Returns:
        An array of length 'buckets' that constitutes the Histogram
      • getTopK

        public Set<String> getTopK()
        Get the topK values.
        Returns:
        The top K values (default: 10).
      • getBottomK

        public Set<String> getBottomK()
        Get the bottomK values.
        Returns:
        The bottom K values (default: 10).
      • getRegExp

        public String getRegExp()
        Get the Regular Expression that reflects the data stream. All valid inputs should match this Regular Expression, however in some instances, not all inputs that match this RE are necessarily valid. For example, 28/13/2017 will match the RE (\d{2}/\d{2}/\d{4}) however this is not a valid date with pattern dd/MM/yyyy (there is no 13th month).
        Returns:
        The Regular Expression.
      • getDataRegExp

        public String getDataRegExp()
        Get the Regular Expression that reflects the non-white space element in the data stream. For example, if a stream contains ' hello' and 'world ' this would return '(?i)(HELLO|WORLD)'.
        Returns:
        The Regular Expression reflecting the non-white space data.
      • getMatchCount

        public long getMatchCount()
        Get the count of all (non-blank/non-null/non-outlier/non-invalid) samples that matched the determined type.
        Returns:
        Count of all matches.
      • getLeadingWhiteSpace

        public boolean getLeadingWhiteSpace()
        Does the set of elements contain any elements with leading White Space?
        Returns:
        True if any elements matched have leading White Space.
      • getTrailingWhiteSpace

        public boolean getTrailingWhiteSpace()
        Does the set of elements contain any elements with trailing White Space?
        Returns:
        True if any elements matched have trailing White Space.
      • getMultiline

        public boolean getMultiline()
        Does the set of elements contain any multi-line elements?
        Returns:
        True if any elements matched are multi-line.
      • getTotalCount

        public long getTotalCount()
        Get the total number of elements in the entire data stream (if known).
        Returns:
        total number of elements in the entire data stream (-1 if not known).
      • getTotalNullCount

        public long getTotalNullCount()
        Get the count of all null elements in the entire data stream (if known). Use getNullCount() for the equivalent on the sample set.
        Returns:
        Count of all null elements in the entire data stream (-1 if not known).
      • getTotalBlankCount

        public long getTotalBlankCount()
        Get the count of all blank elements in the entire data stream (if known). Note: any number (including zero) of spaces are Blank. Use getBlankCount() for the equivalent on the sample set.
        Returns:
        Count of all blank samples in the entire data stream (-1 if not known).
      • getTotalMean

        public Double getTotalMean()
        Get the mean for Numeric types (Long, Double) across the entire data stream (if known). Use getMean() for the equivalent on the sample set.
        Returns:
        The mean across the entire data stream (null if not known).
      • getTotalStandardDeviation

        public Double getTotalStandardDeviation()
        Get the standard deviation for Numeric types (Long, Double) across the entire data stream (if known). Use getStandardDeviation() for the equivalent on the sample set.
        Returns:
        The Standard Deviation across the entire data stream (null if not known).
      • getTotalMinValue

        public String getTotalMinValue()
        Get the minimum value for Numeric, Boolean and String types across the entire data stream (if known). Use getMinValue() for the equivalent on the sample set.
        Returns:
        The minimum value as a String (null if not known).
      • getTotalMaxValue

        public String getTotalMaxValue()
        Get the maximum value for Numeric, Boolean and String across the entire data stream (if known). Use getMaxValue() for the equivalent on the sample set.
        Returns:
        The maximum value as a String (null if not known).
      • getTotalMinLength

        public int getTotalMinLength()
        Get the minimum length for Numeric, Boolean and String across the entire data stream (if known). Note: For String and Boolean types this length includes any whitespace. Use getMinLength() for the equivalent on the sample set.
        Returns:
        The minimum length in the entire Data Stream (-1 if not known).
      • getTotalMaxLength

        public int getTotalMaxLength()
        Get the maximum length for Numeric, Boolean and String across the entire data stream (if known). Note: For String and Boolean types this length includes any whitespace. Use getMaxLength() for the equivalent on the sample set.
        Returns:
        The maximum length in the entire Data Stream (-1 if not known).
      • getSampleCount

        public long getSampleCount()
        Get the count of all samples observed.
        Returns:
        Count of all samples observed.
      • getNullCount

        public long getNullCount()
        Get the count of all null samples.
        Returns:
        Count of all null samples.
      • getBlankCount

        public long getBlankCount()
        Get the count of all blank samples. Note: any number (including zero) of spaces are Blank.
        Returns:
        Count of all blank samples.
      • getLeadingZeroCount

        public long getLeadingZeroCount()
        Get the count of all samples with leading zeros (Type long only). Note: a single '0' does not constitute a sample with a leading zero.
        Returns:
        Count of all leading zero samples.
      • getCardinality

        public int getCardinality()
        Get the cardinality for the current data stream. See setMaxCardinality() method in TextAnalyzer. Note: The cardinality returned is the cardinality of the valid samples. For example, if a date is invalid it will not be included in the cardinality. Note: This is not a complete cardinality analysis unless the cardinality of the data stream is less than the maximum cardinality (Default: 12000). See also setMaxCardinality() method in TextAnalyzer.
        Returns:
        Count of all blank samples.
      • getCardinalityDetails

        public NavigableMap<String,​Long> getCardinalityDetails()
        Get the cardinality details for the current data stream. This is a Map of Strings and the count of occurrences.
        Returns:
        A Map of values and their occurrence frequency of the data stream to date.
      • getOutlierCount

        public int getOutlierCount()
        Get the number of distinct outliers for the current data stream. See setMaxOutliers() method in TextAnalyzer. Note: This is not a complete outlier analysis unless the outlier count of the data stream is less than the maximum outlier count (Default: 50). See also setMaxOutliers() method in TextAnalyzer.
        Returns:
        Count of the distinct outliers.
      • getOutlierDetails

        public Map<String,​Long> getOutlierDetails()
        Get the outlier details for the current data stream. This is a Map of Strings and the count of occurrences.
        Returns:
        A Map of values and their occurrence frequency of the data stream to date.
      • getInvalidCount

        public int getInvalidCount()
        Get the number of distinct invalid entries for the current data stream. See setMaxOutliers() method in TextAnalyzer. Note: This is not a complete invalid analysis unless the invalid count of the data stream is less than the maximum invalid count (Default: 50).
        Returns:
        Count of the distinct invalid entries.
      • getInvalidDetails

        public Map<String,​Long> getInvalidDetails()
        Get the invalid entry details for the current data stream. This is a Map of Strings and the count of occurrences.
        Returns:
        A Map of values and their occurrence frequency of the data stream to date.
      • getShapeCount

        public int getShapeCount()
        Get the number of distinct shapes for the current data stream. Note: This is not a complete shape analysis unless the shape count of the data stream is less than the maximum shape count (Default: 400).
        Returns:
        Count of the distinct shapes.
      • getShapeDetails

        public Map<String,​Long> getShapeDetails()
        Get the shape details for the current data stream. This is a Map of Strings and the count of occurrences.
        Returns:
        A Map of shapes and their occurrence frequency of the data stream to date.
      • getKeyConfidence

        public double getKeyConfidence()
        Is this field a key?
        Returns:
        A Double (0.0 ... 1.0) representing our confidence that this field is a key.
      • getUniqueness

        public double getUniqueness()
        How unique is this field, i.e. the number of elements in the set with a cardinality of one / cardinality. Note: Only supported if the cardinality presented is less than Max Cardinality.
        Returns:
        A Double (0.0 ... 1.0) representing the uniqueness of this field.
      • getDistinctCount

        public long getDistinctCount()
        Return the distinct number of valid values in this stream. Note: Typically only supported if the cardinality presented is less than Max Cardinality. May be set by an external source.
        Returns:
        A long with the number of distinct values in this stream or -1 if unknown.
      • statisticsEnabled

        public boolean statisticsEnabled()
        Was statistics collection enabled for this analysis.
        Returns:
        True if statistics were collected.
      • toString

        public String toString()
        A String representation of the Analysis.
        Overrides:
        toString in class Object
        Returns:
        A String representation of the analysis to date.
      • getStructureSignature

        public String getStructureSignature()
        A SHA-1 hash that reflects the data stream structure. Note: If a Semantic type is detected then the SHA-1 hash will reflect this.
        Returns:
        A String SHA-1 hash that reflects the structure of the data stream.
      • getDataSignature

        public String getDataSignature()
        A SHA-1 hash that reflects the data stream contents. Note: The order of the data stream is not considered.
        Returns:
        A String SHA-1 hash that reflects the data stream contents.
      • asPlugin

        public String asPlugin()
        A plugin definition to use to match this type.
        Returns:
        A JSON representation of the analysis.
      • asJSON

        public String asJSON​(boolean pretty,
                             int verbose)
        A JSON representation of the Analysis.
        Parameters:
        pretty - If set, add minimal whitespace formatting.
        verbose - If > 0 provides additional details on the core, Outlier, and Shapes sets. A value of 1 will output the first 100 elements, a value > 1 will output the full set.
        Returns:
        A JSON representation of the analysis.