Package

epic

preprocess

Permalink

package preprocess

TODO

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. preprocess
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class JavaSentenceSegmenter extends SentenceSegmenter

    Permalink

    A Sentence Segmenter backed by Java's BreakIterator.

    A Sentence Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences

  2. class JavaWordTokenizer extends Tokenizer

    Permalink

    A Word Segmenter backed by Java's BreakIterator.

    A Word Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences Doesn't return spaces, does return punctuation.

  3. class MLSentenceSegmenter extends SentenceSegmenter with Serializable

    Permalink
    Annotations
    @SerialVersionUID()
  4. class NewLineSentenceSegmenter extends SentenceSegmenter

    Permalink
  5. case class RegexSearchTokenizer(pattern: String) extends Tokenizer with Product with Serializable

    Permalink

    Finds all occurrences of the given pattern in the document.

  6. case class RegexSplitTokenizer(pattern: String) extends Tokenizer with Product with Serializable

    Permalink

    Splits the input document according to the given pattern.

    Splits the input document according to the given pattern. Does not return the splits.

  7. class SegmentingIterator extends Iterator[Span]

    Permalink
  8. trait SentenceSegmenter extends StringAnalysisFunction[Any, Sentence] with (String) ⇒ Iterable[String] with Serializable

    Permalink

  9. class StreamSentenceSegmenter extends AnyRef

    Permalink

    TODO

  10. trait Tokenizer extends StringAnalysisFunction[Sentence, Token] with Serializable with (String) ⇒ IndexedSeq[String]

    Permalink

    Abstract trait for tokenizers, which annotate sentence-segmented text with tokens.

    Abstract trait for tokenizers, which annotate sentence-segmented text with tokens. Tokenizers work with both raw strings and epic.slab.StringSlabs.

    Annotations
    @SerialVersionUID()
  11. class TreebankTokenizer extends Tokenizer with Serializable

    Permalink
    Annotations
    @SerialVersionUID()
  12. class WhitespaceTokenizer extends RegexSplitTokenizer

    Permalink

    Tokenizes by splitting on the regular expression \s+.

Value Members

  1. object JavaSentenceSegmenter extends JavaSentenceSegmenter

    Permalink
  2. object JavaWordTokenizer extends JavaWordTokenizer

    Permalink
  3. object MLSentenceSegmenter extends Serializable

    Permalink
  4. object RegexSentenceSegmenter extends SentenceSegmenter

    Permalink

    A simple regex sentence segmenter.

  5. object SegmentSentences

    Permalink
  6. object StreamSentenceSegmenter

    Permalink
  7. object TextExtractor

    Permalink

    Just a simple thing for me to learn Tika

  8. object Textify

    Permalink

    TODO

  9. object TreebankTokenizer extends TreebankTokenizer

    Permalink
  10. object WhitespaceTokenizer extends Serializable

    Permalink
  11. def loadContent(url: URL): String

    Permalink
  12. def preprocess(file: File): IndexedSeq[IndexedSeq[String]]

    Permalink
  13. def preprocess(text: String): IndexedSeq[IndexedSeq[String]]

    Permalink
  14. def preprocess(url: URL): IndexedSeq[IndexedSeq[String]]

    Permalink
  15. def tokenize(sentence: String): IndexedSeq[String]

    Permalink

Inherited from AnyRef

Inherited from Any

Ungrouped