Package

org.clulab.processors.clu

tokenizer

Permalink

package tokenizer

Visibility
  1. Public
  2. All

Type Members

  1. class EnglishLemmatizer extends Lemmatizer

    Permalink
  2. class EnglishSentenceSplitter extends RuleBasedSentenceSplitter

    Permalink

    Splits a sequence of tokens into sentences

  3. trait Lemmatizer extends AnyRef

    Permalink
  4. class OpenDomainEnglishLexer extends TokenizerLexer

    Permalink

    Tokenizer using the OpenDomainLexer.g grammar

  5. class OpenDomainEnglishTokenizer extends Tokenizer

    Permalink

    English open domain tokenizer

  6. class OpenDomainLexer extends Lexer

    Permalink
  7. class OpenDomainPortugueseLexer extends Lexer

    Permalink
  8. class OpenDomainPortugueseTokenizer extends Tokenizer

    Permalink

    Portuguese open domain tokenizer

  9. class OpenDomainPortugueseTokenizerLexer extends TokenizerLexer

    Permalink

    Tokenizer using the OpenDomainLexer.g grammar

  10. class OpenDomainSpanishLexer extends Lexer

    Permalink
  11. class OpenDomainSpanishTokenizer extends Tokenizer

    Permalink

    Spanish open domain tokenizer

  12. class OpenDomainSpanishTokenizerLexer extends TokenizerLexer

    Permalink

    Tokenizer using the OpenDomainLexer.g grammar

  13. class PortugueseLemmatizer extends Lemmatizer

    Permalink
  14. class PortugueseSentenceSplitter extends RuleBasedSentenceSplitter

    Permalink

    Splits a sequence of Portuguese tokens into sentences

  15. case class RawToken(raw: String, beginPosition: Int, endPosition: Int, word: String) extends Product with Serializable

    Permalink

    Stores a token as produced by a tokenizer

    Stores a token as produced by a tokenizer

    raw

    The EXACT text tokenized

    beginPosition

    beginning character offset of raw

    endPosition

    end character offset of raw

    word

    Normalized form raw, e.g., "'m" becomes "am". Note: these are NOT lemmas.

  16. abstract class RuleBasedSentenceSplitter extends SentenceSplitter

    Permalink
  17. trait SentenceSplitter extends AnyRef

    Permalink
  18. class SpanishLemmatizer extends Lemmatizer

    Permalink
  19. class SpanishSentenceSplitter extends RuleBasedSentenceSplitter

    Permalink

    Splits a sequence of Spanish tokens into sentences

  20. class Tokenizer extends AnyRef

    Permalink

    Generic tokenizer Author: mihais Date: 3/15/17

  21. trait TokenizerLexer extends AnyRef

    Permalink

    Thin wrapper over the Antlr lexer Author: mihais Date: 3/21/17

  22. trait TokenizerStep extends AnyRef

    Permalink

    Implements one step of a tokenization algorithm, which takes in a sequence of tokens and produces another For example, contractions such as "don't" are handled here; domain-specific operations as well.

    Implements one step of a tokenization algorithm, which takes in a sequence of tokens and produces another For example, contractions such as "don't" are handled here; domain-specific operations as well. Note: one constraint that must be obeyed by any TokenizerStep is that RawToken.raw and the corresponding character positions must preserve the original text

  23. class TokenizerStepAccentedNormalization extends TokenizerStepNormalization

    Permalink

    Normalize text while keeping crucial accented characters, e.g.

    Normalize text while keeping crucial accented characters, e.g. 'รก'.

  24. class TokenizerStepContractions extends TokenizerStep

    Permalink

    Resolves English contractions Author: mihais Date: 3/21/17

  25. class TokenizerStepHyphens extends TokenizerStep

    Permalink

    Tokenizes some hyphenated prefixes, which are better handled downstream as separate tokens For example: "mid-July" is separated into "mid" and "July", which is better for date recognition

  26. class TokenizerStepNormalization extends TokenizerStep

    Permalink
  27. class TokenizerStepPortugueseContractions extends TokenizerStep

    Permalink

    Resolves Portugese contractions Author: dane Author: mihais Date: 7/10/2018

  28. class TokenizerStepSpanishContractions extends TokenizerStep

    Permalink

    Resolves Spanish contractions Author: dane Author: mihais Date: 7/23/2018

Value Members

  1. object EnglishLemmatizer

    Permalink
  2. object RawToken extends Serializable

    Permalink
  3. object SentenceSplitter

    Permalink
  4. object TokenizerStepNormalization

    Permalink

Ungrouped