package preprocess
TODO
- Alphabetic
- By Inheritance
- preprocess
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
class
JavaSentenceSegmenter
extends SentenceSegmenter
A Sentence Segmenter backed by Java's BreakIterator.
A Sentence Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences
-
class
JavaWordTokenizer
extends Tokenizer
A Word Segmenter backed by Java's BreakIterator.
A Word Segmenter backed by Java's BreakIterator. Given an input string, it will return an iterator over sentences Doesn't return spaces, does return punctuation.
-
case class
MLSentenceSegmenter
(inf: ClassificationInference) extends SentenceSegmenter with Serializable with Product with Serializable
- Annotations
- @SerialVersionUID()
- class NewLineSentenceSegmenter extends SentenceSegmenter
-
case class
RegexSearchTokenizer
(pattern: String) extends Tokenizer with Product with Serializable
Finds all occurrences of the given pattern in the document.
-
case class
RegexSplitTokenizer
(pattern: String) extends Tokenizer with Product with Serializable
Splits the input document according to the given pattern.
Splits the input document according to the given pattern. Does not return the splits.
- class SegmentingIterator extends Iterator[Span]
- trait SentenceSegmenter extends StringAnalysisFunction[Any, Sentence] with (String) ⇒ Iterable[String] with Serializable
-
class
StreamSentenceSegmenter
extends AnyRef
TODO
-
trait
Tokenizer
extends StringAnalysisFunction[Sentence, Token] with Serializable with (String) ⇒ IndexedSeq[String]
Abstract trait for tokenizers, which annotate sentence-segmented text with tokens.
Abstract trait for tokenizers, which annotate sentence-segmented text with tokens. Tokenizers work with both raw strings and epic.slab.StringSlabs.
- Annotations
- @SerialVersionUID()
-
class
TreebankTokenizer
extends Tokenizer with Serializable
- Annotations
- @SerialVersionUID()
-
class
WhitespaceTokenizer
extends RegexSplitTokenizer
Tokenizes by splitting on the regular expression \s+.
Value Members
- def loadContent(url: URL): String
- def preprocess(file: File): IndexedSeq[IndexedSeq[String]]
- def preprocess(text: String): IndexedSeq[IndexedSeq[String]]
- def preprocess(url: URL): IndexedSeq[IndexedSeq[String]]
- def tokenize(sentence: String): IndexedSeq[String]
- object JavaSentenceSegmenter extends JavaSentenceSegmenter
- object JavaWordTokenizer extends JavaWordTokenizer
- object MLSentenceSegmenter extends Serializable
-
object
RegexSentenceSegmenter
extends SentenceSegmenter
A simple regex sentence segmenter.
- object SegmentSentences
- object StreamSentenceSegmenter
-
object
TextExtractor
Just a simple thing for me to learn Tika
-
object
Textify
TODO
- object TreebankTokenizer extends TreebankTokenizer
- object WhitespaceTokenizer extends Serializable