Class TextConfig

Object
com.cobber.fta.text.TextConfig

public final class TextConfig extends Object
Capture a set of key metrics for any given language.
  • Constructor Details

    • TextConfig

      public TextConfig(int longWord, double averageLow, double averageHigh, int alphaSpacePercentage, int simplePercentage, int maxLength, String sentenceBreak, String wordBreak, String punctuation, String[] starts)
  • Method Details

    • getLongWord

      public int getLongWord()
      The maximum length we expect any likely word to be in the target language.
      Returns:
      The maximum length we expect any likely word to be in the target language.
    • getAverageLow

      public double getAverageLow()
      A reasonable lower bound for the average word length in the language.
      Returns:
      A reasonable lower bound for the average word length in the language.
    • getAverageHigh

      public double getAverageHigh()
      A reasonable upper bound for the average word length in the language.
      Returns:
      A reasonable upper bound for the average word length in the language.
    • getAlphaSpacePercentage

      public int getAlphaSpacePercentage()
      An estimate of the percentage of 'alpha' or space (isWhiteSpace()) characters that we expect to be present.
      Returns:
      An estimate of the percentage of 'alpha' or space (isWhiteSpace()) characters that we expect to be present.
    • getSimplePercentage

      public int getSimplePercentage()
      An estimate of the percentage of 'reasonable' characters that we expect to be present. Note: The reasonable characters are defined as the sum of: - alphas, digits (in digit only words), wordBreaks, spaces, and punctuation
      Returns:
      An estimate of the percentage of 'reasonable' characters that we expect to be present.
    • getMaxLength

      public int getMaxLength()
      The maximum number of character to analyze in the input.
      Returns:
      The maximum number of character to analyze in the input.
    • getSentenceBreak

      public String getSentenceBreak()
      The Sentence Break characters.
      Returns:
      The set of characters used to break paragraphs into sentences.
    • getWordBreak

      public String getWordBreak()
      The Word Break characters.
      Returns:
      The set of characters used to break sentences into words.
    • getPunctuation

      public String getPunctuation()
      The Punctuation characters.
      Returns:
      The set of characters recognized as punctuation.
    • getStarts

      public Set<String> getStarts()
      The 'likely' set of two character initial stems. For example, in English 'fo' is reasonable (for, form, foot, ...) whereas 'xz' is not, as no words start with xz.
      Returns:
      The 'likely' set of two character initial stems.