Package com.cobber.fta
Class TextAnalyzer
- java.lang.Object
-
- com.cobber.fta.TextAnalyzer
-
public class TextAnalyzer extends java.lang.Object
Analyze Text data to determine type information and other key metrics associated with a text stream. A key objective of the analysis is that it should be sufficiently fast to be in-line (i.e. as the data is input from some source it should be possible to stream the data through this class without undue performance degradation).Typical usage is:
TextAnalyzer analysis = new TextAnalyzer("Age"); analysis.train("12"); analysis.train("62"); analysis.train("21"); analysis.train("37"); ... TextAnalysisResult result = analysis.getResult();
-
-
Field Summary
Fields Modifier and Type Field Description protected static int
REFLECTION_SAMPLES
-
Constructor Summary
Constructors Constructor Description TextAnalyzer()
Construct an anonymous Text Analyzer for a data stream.TextAnalyzer(AnalyzerContext context)
Construct a Text Analyzer using the supplied context.TextAnalyzer(java.lang.String name)
Construct a Text Analyzer for the named data stream.TextAnalyzer(java.lang.String name, DateTimeParser.DateResolutionMode resolutionMode)
Construct a Text Analyzer for the named data stream with the supplied DateResolutionMode.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static int
distanceLevenshtein(java.lang.String source, java.util.Set<java.lang.String> universe)
boolean
getCollectStatistics()
Indicates whether to collect statistics or not.boolean
getDefaultLogicalTypes()
Indicates whether to enable default Logical Type processing or not.int
getDetectWindow()
Get the size of the Detect Window (i.e number of Samples used to collect before attempting to determine the type.boolean
getLengthQualifier()
Indicates whether the size of the RegExp pattern is being defined.int
getMaxCardinality()
Get the maximum cardinality that will be tracked.int
getMaxOutliers()
Get the maximum number of outliers that will be tracked.boolean
getNumericWidening()
Get the current value for numeric widening.Plugins
getPlugins()
int
getPluginThreshold()
Get the current detection Threshold for Logical Type plugins.int
getReflectionSampleSize()
Get the number of Samples required before we will 'reflect' on the analysis and potentially change determination.TextAnalysisResult
getResult()
Determine the result of the training complete to date.java.lang.String
getStreamName()
Get the name of the Data Stream.int
getThreshold()
Get the current detection Threshold.java.util.List<java.lang.String>
getTrainingSet()
Access the training set - this will typically be the firstAnalysisConfig.DETECT_WINDOW_DEFAULT
records.void
registerDefaultPlugins(java.util.Locale locale)
Register the default set of plugins for Logical Type detection.boolean
setCollectStatistics(boolean collectStatistics)
Indicate whether to collect statistics or not.void
setDebug(int debug)
Internal Only.boolean
setDefaultLogicalTypes(boolean logicalTypeDetection)
Indicate whether to enable default Logical Type processing.int
setDetectWindow(int detectWindow)
Set the size of the Detect Window (i.e.void
setKeyConfidence(double keyConfidence)
Set the Key Confidence - this is typically used where we have an external source that indicated definitively that this is a key.boolean
setLengthQualifier(boolean newLengthQualifier)
Indicate whether we should qualify the size of the RegExp.void
setLocale(java.util.Locale locale)
Override the default Locale.int
setMaxCardinality(int newCardinality)
Set the maximum cardinality that will be tracked.int
setMaxOutliers(int newMaxOutliers)
Set the maximum number of outliers that will be tracked.void
setNumericWidening(boolean numericWidening)
If true enable Numeric widening - i.e.void
setPluginThreshold(int threshold)
The percentage when we declare success 0 - 100 for Logical Type plugins.void
setThreshold(int threshold)
The percentage when we declare success 0 - 100.void
setTotalCount(long totalCount)
Set the total number of elements in the Data Stream (if known).void
setTrace(java.lang.String traceOptions)
Set tracing options.void
setUniqueness(double uniqueness)
Set the Uniqueness - this is typically used where we have an external source that has visibility into the entire data set and 'knows' the uniqueness of the set as a whole.boolean
train(java.lang.String rawInput)
Train is the streaming entry point used to supply input to the Text Analyzer.void
trainBulk(java.util.Map<java.lang.String,java.lang.Long> observed)
TrainBulk is the core bulk entry point used to supply input to the Text Analyzer.
-
-
-
Field Detail
-
REFLECTION_SAMPLES
protected static final int REFLECTION_SAMPLES
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
TextAnalyzer
public TextAnalyzer(AnalyzerContext context)
Construct a Text Analyzer using the supplied context.- Parameters:
context
- The context used to interpret the stream.
-
TextAnalyzer
public TextAnalyzer(java.lang.String name)
Construct a Text Analyzer for the named data stream. Note: The resolution mode will be 'None'.- Parameters:
name
- The name of the data stream (e.g. the column of the CSV file)
-
TextAnalyzer
public TextAnalyzer()
Construct an anonymous Text Analyzer for a data stream. Note: The resolution mode will be 'None'.
-
TextAnalyzer
public TextAnalyzer(java.lang.String name, DateTimeParser.DateResolutionMode resolutionMode)
Construct a Text Analyzer for the named data stream with the supplied DateResolutionMode.- Parameters:
name
- The name of the data stream (e.g. the column of the CSV file)resolutionMode
- Determines what to do when the Date field is ambiguous (i.e. we cannot determine which of the fields is the day or the month. If resolutionMode is DayFirst, then assume day is first, if resolutionMode is MonthFirst then assume month is first, if it is Auto then choose either DayFirst or MonthFirst based on the locale, if it is None then the pattern returned will have '?' in to represent any ambiguity present.
-
-
Method Detail
-
getStreamName
public java.lang.String getStreamName()
Get the name of the Data Stream.- Returns:
- The name of the Data Stream.
-
setCollectStatistics
public boolean setCollectStatistics(boolean collectStatistics)
Indicate whether to collect statistics or not.- Parameters:
collectStatistics
- A boolean indicating the desired state- Returns:
- The previous value of this parameter.
-
setDebug
public void setDebug(int debug)
Internal Only. Enable internal debugging.- Parameters:
debug
- The debug level.
-
setTrace
public void setTrace(java.lang.String traceOptions)
Set tracing options. General form of options is <attribute1>=<value1>,<attribute2>=<value2> ... Supported attributes are: enabled=true/false, stream=<name of stream> (defaults to all) directory=<directory for trace file> (defaults to java.io.tmpdir) samples=<# samples to trace> (defaults to 1000)- Parameters:
traceOptions
- The trace options.
-
getCollectStatistics
public boolean getCollectStatistics()
Indicates whether to collect statistics or not.- Returns:
- Whether Statistics collection is enabled.
-
setDefaultLogicalTypes
public boolean setDefaultLogicalTypes(boolean logicalTypeDetection)
Indicate whether to enable default Logical Type processing.- Parameters:
logicalTypeDetection
- A boolean indicating the desired state- Returns:
- The previous value of this parameter.
-
getDefaultLogicalTypes
public boolean getDefaultLogicalTypes()
Indicates whether to enable default Logical Type processing or not.- Returns:
- Whether default Logical Type processing collection is enabled.
-
setThreshold
public void setThreshold(int threshold)
The percentage when we declare success 0 - 100. Typically this should not be adjusted, if you want to run in Strict mode then set this to 100.- Parameters:
threshold
- The new threshold for detection.
-
getThreshold
public int getThreshold()
Get the current detection Threshold.- Returns:
- The current threshold.
-
setPluginThreshold
public void setPluginThreshold(int threshold)
The percentage when we declare success 0 - 100 for Logical Type plugins. Typically this should not be adjusted, if you want to run in Strict mode then set this to 100.- Parameters:
threshold
- The new threshold used for detection.
-
getPluginThreshold
public int getPluginThreshold()
Get the current detection Threshold for Logical Type plugins. If not set, this will return -1, this means that each plugin is using a default threshold and doing something sensible!- Returns:
- The current threshold.
-
setNumericWidening
public void setNumericWidening(boolean numericWidening)
If true enable Numeric widening - i.e. if we see lots of integers then some doubles call it a double.- Parameters:
numericWidening
- The new value for numericWidening.
-
getNumericWidening
public boolean getNumericWidening()
Get the current value for numeric widening.- Returns:
- The current value.
-
setLocale
public void setLocale(java.util.Locale locale)
Override the default Locale.- Parameters:
locale
- The new Locale used to determine separators in numbers, date processing, default plugins, etc. Note: There is no support for Locales that do not use the Gregorian Calendar.
-
setDetectWindow
public int setDetectWindow(int detectWindow)
Set the size of the Detect Window (i.e. number of samples) to collect before attempting to determine the type. Note: It is not possible to change the Sample Size once training has started.- Parameters:
detectWindow
- The number of samples to collect- Returns:
- The previous value of this parameter.
-
getDetectWindow
public int getDetectWindow()
Get the size of the Detect Window (i.e number of Samples used to collect before attempting to determine the type.- Returns:
- The current size of the Detect Window.
-
getReflectionSampleSize
public int getReflectionSampleSize()
Get the number of Samples required before we will 'reflect' on the analysis and potentially change determination.- Returns:
- The current size of the reflection window.
-
setMaxCardinality
public int setMaxCardinality(int newCardinality)
Set the maximum cardinality that will be tracked. Note: - The Cardinality must be larger than the Cardinality of the largest Finite Logical type. - It is not possible to change the cardinality once training has started.- Parameters:
newCardinality
- The maximum Cardinality that will be tracked (0 implies no tracking)- Returns:
- The previous value of this parameter.
-
getMaxCardinality
public int getMaxCardinality()
Get the maximum cardinality that will be tracked. SeesetMaxCardinality()
method.- Returns:
- The maximum cardinality.
-
setMaxOutliers
public int setMaxOutliers(int newMaxOutliers)
Set the maximum number of outliers that will be tracked. Note: It is not possible to change the outlier count once training has started.- Parameters:
newMaxOutliers
- The maximum number of outliers that will be tracked (0 implies no tracking)- Returns:
- The previous value of this parameter.
-
getMaxOutliers
public int getMaxOutliers()
Get the maximum number of outliers that will be tracked. SeesetMaxOutliers()
method.- Returns:
- The maximum cardinality.
-
setLengthQualifier
public boolean setLengthQualifier(boolean newLengthQualifier)
Indicate whether we should qualify the size of the RegExp. For example "\d{3,6}" vs. "\d+" Note: This only impacts simple Numerics/Alphas/AlphaNumerics.- Parameters:
newLengthQualifier
- The new value.- Returns:
- The previous value of this parameter.
-
getLengthQualifier
public boolean getLengthQualifier()
Indicates whether the size of the RegExp pattern is being defined.- Returns:
- True if lengths are being qualified.
-
setKeyConfidence
public void setKeyConfidence(double keyConfidence)
Set the Key Confidence - this is typically used where we have an external source that indicated definitively that this is a key.- Parameters:
keyConfidence
- The new keyConfidence
-
setUniqueness
public void setUniqueness(double uniqueness)
Set the Uniqueness - this is typically used where we have an external source that has visibility into the entire data set and 'knows' the uniqueness of the set as a whole.- Parameters:
uniqueness
- The new Uniqueness
-
setTotalCount
public void setTotalCount(long totalCount)
Set the total number of elements in the Data Stream (if known).- Parameters:
totalCount
- The total number of elements, as opposed to the number sampled.
-
getPlugins
public Plugins getPlugins()
-
registerDefaultPlugins
public void registerDefaultPlugins(java.util.Locale locale)
Register the default set of plugins for Logical Type detection.- Parameters:
locale
- The Locale used for analysis, the will impact both the set of plugins registered as well as the behavior of the individual plugins Note: If the locale is null it will default to the Default locale.
-
trainBulk
public void trainBulk(java.util.Map<java.lang.String,java.lang.Long> observed) throws FTAPluginException, FTAUnsupportedLocaleException
TrainBulk is the core bulk entry point used to supply input to the Text Analyzer. This routine is commonly used to support training using the results aggregated from a database query.- Parameters:
observed
- A Map containing the observed items and the corresponding count- Throws:
FTAPluginException
- Thrown when a registered plugin has detected an issueFTAUnsupportedLocaleException
- Thrown when a requested locale is not supported
-
train
public boolean train(java.lang.String rawInput) throws FTAPluginException, FTAUnsupportedLocaleException
Train is the streaming entry point used to supply input to the Text Analyzer.- Parameters:
rawInput
- The raw input as a String- Returns:
- A boolean indicating if the resultant type is currently known.
- Throws:
FTAPluginException
- Thrown when a registered plugin has detected an issueFTAUnsupportedLocaleException
- Thrown when a requested locale is not supported
-
distanceLevenshtein
public static int distanceLevenshtein(java.lang.String source, java.util.Set<java.lang.String> universe)
-
getResult
public TextAnalysisResult getResult() throws FTAPluginException, FTAUnsupportedLocaleException
Determine the result of the training complete to date. Typically invoked after all training is complete, but may be invoked at any stage.- Returns:
- A TextAnalysisResult with the analysis of any training completed.
- Throws:
FTAPluginException
- Thrown when a registered plugin has detected an issueFTAUnsupportedLocaleException
- Thrown when a requested locale is not supported
-
getTrainingSet
public java.util.List<java.lang.String> getTrainingSet()
Access the training set - this will typically be the firstAnalysisConfig.DETECT_WINDOW_DEFAULT
records.- Returns:
- A List of the raw input strings.
-
-