Class TextAnalyzer


  • public class TextAnalyzer
    extends java.lang.Object
    Analyze Text data to determine type information and other key metrics associated with a text stream. A key objective of the analysis is that it should be sufficiently fast to be in-line (i.e. as the data is input from some source it should be possible to stream the data through this class without undue performance degradation).

    Typical usage is:

     
     		TextAnalyzer analysis = new TextAnalyzer("Age");
    
     		analysis.train("12");
     		analysis.train("62");
     		analysis.train("21");
     		analysis.train("37");
     		...
    
     		TextAnalysisResult result = analysis.getResult();
     
     
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static int REFLECTION_SAMPLES  
    • Constructor Summary

      Constructors 
      Constructor Description
      TextAnalyzer()
      Construct an anonymous Text Analyzer for a data stream.
      TextAnalyzer​(AnalyzerContext context)
      Construct a Text Analyzer using the supplied context.
      TextAnalyzer​(java.lang.String name)
      Construct a Text Analyzer for the named data stream.
      TextAnalyzer​(java.lang.String name, DateTimeParser.DateResolutionMode resolutionMode)
      Construct a Text Analyzer for the named data stream with the supplied DateResolutionMode.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static int distanceLevenshtein​(java.lang.String source, java.util.Set<java.lang.String> universe)  
      boolean getCollectStatistics()
      Indicates whether to collect statistics or not.
      boolean getDefaultLogicalTypes()
      Indicates whether to enable default Logical Type processing or not.
      int getDetectWindow()
      Get the size of the Detect Window (i.e number of Samples used to collect before attempting to determine the type.
      boolean getLengthQualifier()
      Indicates whether the size of the RegExp pattern is being defined.
      int getMaxCardinality()
      Get the maximum cardinality that will be tracked.
      int getMaxOutliers()
      Get the maximum number of outliers that will be tracked.
      boolean getNumericWidening()
      Get the current value for numeric widening.
      Plugins getPlugins()  
      int getPluginThreshold()
      Get the current detection Threshold for Logical Type plugins.
      int getReflectionSampleSize()
      Get the number of Samples required before we will 'reflect' on the analysis and potentially change determination.
      TextAnalysisResult getResult()
      Determine the result of the training complete to date.
      java.lang.String getStreamName()
      Get the name of the Data Stream.
      int getThreshold()
      Get the current detection Threshold.
      java.util.List<java.lang.String> getTrainingSet()
      Access the training set - this will typically be the first AnalysisConfig.DETECT_WINDOW_DEFAULT records.
      void registerDefaultPlugins​(java.util.Locale locale)
      Register the default set of plugins for Logical Type detection.
      boolean setCollectStatistics​(boolean collectStatistics)
      Indicate whether to collect statistics or not.
      void setDebug​(int debug)
      Internal Only.
      boolean setDefaultLogicalTypes​(boolean logicalTypeDetection)
      Indicate whether to enable default Logical Type processing.
      int setDetectWindow​(int detectWindow)
      Set the size of the Detect Window (i.e.
      void setKeyConfidence​(double keyConfidence)
      Set the Key Confidence - this is typically used where we have an external source that indicated definitively that this is a key.
      boolean setLengthQualifier​(boolean newLengthQualifier)
      Indicate whether we should qualify the size of the RegExp.
      void setLocale​(java.util.Locale locale)
      Override the default Locale.
      int setMaxCardinality​(int newCardinality)
      Set the maximum cardinality that will be tracked.
      int setMaxOutliers​(int newMaxOutliers)
      Set the maximum number of outliers that will be tracked.
      void setNumericWidening​(boolean numericWidening)
      If true enable Numeric widening - i.e.
      void setPluginThreshold​(int threshold)
      The percentage when we declare success 0 - 100 for Logical Type plugins.
      void setThreshold​(int threshold)
      The percentage when we declare success 0 - 100.
      void setTotalCount​(long totalCount)
      Set the total number of elements in the Data Stream (if known).
      void setTrace​(java.lang.String traceOptions)
      Set tracing options.
      void setUniqueness​(double uniqueness)
      Set the Uniqueness - this is typically used where we have an external source that has visibility into the entire data set and 'knows' the uniqueness of the set as a whole.
      boolean train​(java.lang.String rawInput)
      Train is the streaming entry point used to supply input to the Text Analyzer.
      void trainBulk​(java.util.Map<java.lang.String,​java.lang.Long> observed)
      TrainBulk is the core bulk entry point used to supply input to the Text Analyzer.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TextAnalyzer

        public TextAnalyzer​(AnalyzerContext context)
        Construct a Text Analyzer using the supplied context.
        Parameters:
        context - The context used to interpret the stream.
      • TextAnalyzer

        public TextAnalyzer​(java.lang.String name)
        Construct a Text Analyzer for the named data stream. Note: The resolution mode will be 'None'.
        Parameters:
        name - The name of the data stream (e.g. the column of the CSV file)
      • TextAnalyzer

        public TextAnalyzer()
        Construct an anonymous Text Analyzer for a data stream. Note: The resolution mode will be 'None'.
      • TextAnalyzer

        public TextAnalyzer​(java.lang.String name,
                            DateTimeParser.DateResolutionMode resolutionMode)
        Construct a Text Analyzer for the named data stream with the supplied DateResolutionMode.
        Parameters:
        name - The name of the data stream (e.g. the column of the CSV file)
        resolutionMode - Determines what to do when the Date field is ambiguous (i.e. we cannot determine which of the fields is the day or the month. If resolutionMode is DayFirst, then assume day is first, if resolutionMode is MonthFirst then assume month is first, if it is Auto then choose either DayFirst or MonthFirst based on the locale, if it is None then the pattern returned will have '?' in to represent any ambiguity present.
    • Method Detail

      • getStreamName

        public java.lang.String getStreamName()
        Get the name of the Data Stream.
        Returns:
        The name of the Data Stream.
      • setCollectStatistics

        public boolean setCollectStatistics​(boolean collectStatistics)
        Indicate whether to collect statistics or not.
        Parameters:
        collectStatistics - A boolean indicating the desired state
        Returns:
        The previous value of this parameter.
      • setDebug

        public void setDebug​(int debug)
        Internal Only. Enable internal debugging.
        Parameters:
        debug - The debug level.
      • setTrace

        public void setTrace​(java.lang.String traceOptions)
        Set tracing options. General form of options is <attribute1>=<value1>,<attribute2>=<value2> ... Supported attributes are: enabled=true/false, stream=<name of stream> (defaults to all) directory=<directory for trace file> (defaults to java.io.tmpdir) samples=<# samples to trace> (defaults to 1000)
        Parameters:
        traceOptions - The trace options.
      • getCollectStatistics

        public boolean getCollectStatistics()
        Indicates whether to collect statistics or not.
        Returns:
        Whether Statistics collection is enabled.
      • setDefaultLogicalTypes

        public boolean setDefaultLogicalTypes​(boolean logicalTypeDetection)
        Indicate whether to enable default Logical Type processing.
        Parameters:
        logicalTypeDetection - A boolean indicating the desired state
        Returns:
        The previous value of this parameter.
      • getDefaultLogicalTypes

        public boolean getDefaultLogicalTypes()
        Indicates whether to enable default Logical Type processing or not.
        Returns:
        Whether default Logical Type processing collection is enabled.
      • setThreshold

        public void setThreshold​(int threshold)
        The percentage when we declare success 0 - 100. Typically this should not be adjusted, if you want to run in Strict mode then set this to 100.
        Parameters:
        threshold - The new threshold for detection.
      • getThreshold

        public int getThreshold()
        Get the current detection Threshold.
        Returns:
        The current threshold.
      • setPluginThreshold

        public void setPluginThreshold​(int threshold)
        The percentage when we declare success 0 - 100 for Logical Type plugins. Typically this should not be adjusted, if you want to run in Strict mode then set this to 100.
        Parameters:
        threshold - The new threshold used for detection.
      • getPluginThreshold

        public int getPluginThreshold()
        Get the current detection Threshold for Logical Type plugins. If not set, this will return -1, this means that each plugin is using a default threshold and doing something sensible!
        Returns:
        The current threshold.
      • setNumericWidening

        public void setNumericWidening​(boolean numericWidening)
        If true enable Numeric widening - i.e. if we see lots of integers then some doubles call it a double.
        Parameters:
        numericWidening - The new value for numericWidening.
      • getNumericWidening

        public boolean getNumericWidening()
        Get the current value for numeric widening.
        Returns:
        The current value.
      • setLocale

        public void setLocale​(java.util.Locale locale)
        Override the default Locale.
        Parameters:
        locale - The new Locale used to determine separators in numbers, date processing, default plugins, etc. Note: There is no support for Locales that do not use the Gregorian Calendar.
      • setDetectWindow

        public int setDetectWindow​(int detectWindow)
        Set the size of the Detect Window (i.e. number of samples) to collect before attempting to determine the type. Note: It is not possible to change the Sample Size once training has started.
        Parameters:
        detectWindow - The number of samples to collect
        Returns:
        The previous value of this parameter.
      • getDetectWindow

        public int getDetectWindow()
        Get the size of the Detect Window (i.e number of Samples used to collect before attempting to determine the type.
        Returns:
        The current size of the Detect Window.
      • getReflectionSampleSize

        public int getReflectionSampleSize()
        Get the number of Samples required before we will 'reflect' on the analysis and potentially change determination.
        Returns:
        The current size of the reflection window.
      • setMaxCardinality

        public int setMaxCardinality​(int newCardinality)
        Set the maximum cardinality that will be tracked. Note: - The Cardinality must be larger than the Cardinality of the largest Finite Logical type. - It is not possible to change the cardinality once training has started.
        Parameters:
        newCardinality - The maximum Cardinality that will be tracked (0 implies no tracking)
        Returns:
        The previous value of this parameter.
      • getMaxCardinality

        public int getMaxCardinality()
        Get the maximum cardinality that will be tracked. See setMaxCardinality() method.
        Returns:
        The maximum cardinality.
      • setMaxOutliers

        public int setMaxOutliers​(int newMaxOutliers)
        Set the maximum number of outliers that will be tracked. Note: It is not possible to change the outlier count once training has started.
        Parameters:
        newMaxOutliers - The maximum number of outliers that will be tracked (0 implies no tracking)
        Returns:
        The previous value of this parameter.
      • getMaxOutliers

        public int getMaxOutliers()
        Get the maximum number of outliers that will be tracked. See setMaxOutliers() method.
        Returns:
        The maximum cardinality.
      • setLengthQualifier

        public boolean setLengthQualifier​(boolean newLengthQualifier)
        Indicate whether we should qualify the size of the RegExp. For example "\d{3,6}" vs. "\d+" Note: This only impacts simple Numerics/Alphas/AlphaNumerics.
        Parameters:
        newLengthQualifier - The new value.
        Returns:
        The previous value of this parameter.
      • getLengthQualifier

        public boolean getLengthQualifier()
        Indicates whether the size of the RegExp pattern is being defined.
        Returns:
        True if lengths are being qualified.
      • setKeyConfidence

        public void setKeyConfidence​(double keyConfidence)
        Set the Key Confidence - this is typically used where we have an external source that indicated definitively that this is a key.
        Parameters:
        keyConfidence - The new keyConfidence
      • setUniqueness

        public void setUniqueness​(double uniqueness)
        Set the Uniqueness - this is typically used where we have an external source that has visibility into the entire data set and 'knows' the uniqueness of the set as a whole.
        Parameters:
        uniqueness - The new Uniqueness
      • setTotalCount

        public void setTotalCount​(long totalCount)
        Set the total number of elements in the Data Stream (if known).
        Parameters:
        totalCount - The total number of elements, as opposed to the number sampled.
      • getPlugins

        public Plugins getPlugins()
      • registerDefaultPlugins

        public void registerDefaultPlugins​(java.util.Locale locale)
        Register the default set of plugins for Logical Type detection.
        Parameters:
        locale - The Locale used for analysis, the will impact both the set of plugins registered as well as the behavior of the individual plugins Note: If the locale is null it will default to the Default locale.
      • trainBulk

        public void trainBulk​(java.util.Map<java.lang.String,​java.lang.Long> observed)
                       throws FTAPluginException,
                              FTAUnsupportedLocaleException
        TrainBulk is the core bulk entry point used to supply input to the Text Analyzer. This routine is commonly used to support training using the results aggregated from a database query.
        Parameters:
        observed - A Map containing the observed items and the corresponding count
        Throws:
        FTAPluginException - Thrown when a registered plugin has detected an issue
        FTAUnsupportedLocaleException - Thrown when a requested locale is not supported
      • train

        public boolean train​(java.lang.String rawInput)
                      throws FTAPluginException,
                             FTAUnsupportedLocaleException
        Train is the streaming entry point used to supply input to the Text Analyzer.
        Parameters:
        rawInput - The raw input as a String
        Returns:
        A boolean indicating if the resultant type is currently known.
        Throws:
        FTAPluginException - Thrown when a registered plugin has detected an issue
        FTAUnsupportedLocaleException - Thrown when a requested locale is not supported
      • distanceLevenshtein

        public static int distanceLevenshtein​(java.lang.String source,
                                              java.util.Set<java.lang.String> universe)
      • getTrainingSet

        public java.util.List<java.lang.String> getTrainingSet()
        Access the training set - this will typically be the first AnalysisConfig.DETECT_WINDOW_DEFAULT records.
        Returns:
        A List of the raw input strings.