Package com.worksap.nlp.sudachi
Interface Tokenizer
-
public interface Tokenizer
A tokenizer of morphological analysis.
-
-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static class
Tokenizer.SplitMode
A mode of splitting
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description String
dumpInternalStructures(String text)
Tokenize a text and dump the internal structures into a JSON string.void
setDumpOutput(PrintStream output)
Prints a lattice structure of analyzing.List<Morpheme>
tokenize(Tokenizer.SplitMode mode, String text)
Tokenize a text.default List<Morpheme>
tokenize(String text)
Tokenize a text.Iterable<List<Morpheme>>
tokenizeSentences(Tokenizer.SplitMode mode, Reader input)
Tokenize sentences.Iterable<List<Morpheme>>
tokenizeSentences(Tokenizer.SplitMode mode, String text)
Tokenize sentences.default Iterable<List<Morpheme>>
tokenizeSentences(Reader input)
Tokenize sentences.default Iterable<List<Morpheme>>
tokenizeSentences(String text)
Tokenize sentences.
-
-
-
Method Detail
-
tokenize
List<Morpheme> tokenize(Tokenizer.SplitMode mode, String text)
Tokenize a text. This method tokenizes a input text as a sentence. When the text is long, it required a lot of memory.- Parameters:
mode
- a mode of splittingtext
- input text- Returns:
- a result of tokenizing
-
tokenize
default List<Morpheme> tokenize(String text)
Tokenize a text. Tokenize a text withTokenizer.SplitMode
.C.- Parameters:
text
- input text- Returns:
- a result of tokenizing
- See Also:
tokenize(SplitMode,String)
-
tokenizeSentences
Iterable<List<Morpheme>> tokenizeSentences(Tokenizer.SplitMode mode, String text)
Tokenize sentences. This method divide a input text into sentences and tokenizes them.- Parameters:
mode
- a mode of splittingtext
- input text- Returns:
- a result of tokenizing
-
tokenizeSentences
default Iterable<List<Morpheme>> tokenizeSentences(String text)
Tokenize sentences. This method divide a input text into sentences and tokenizes them withTokenizer.SplitMode
.C.- Parameters:
text
- input text- Returns:
- a result of tokenizing
- See Also:
tokenizeSentences(SplitMode,String)
-
tokenizeSentences
Iterable<List<Morpheme>> tokenizeSentences(Tokenizer.SplitMode mode, Reader input) throws IOException
Tokenize sentences. This method reads a input text frominput
and divides it into sentences and tokenizes them.- Parameters:
mode
- a mode of splittinginput
- a reader of input text- Returns:
- a result of tokenizing
- Throws:
IOException
- if reading a stream is failed
-
tokenizeSentences
default Iterable<List<Morpheme>> tokenizeSentences(Reader input) throws IOException
Tokenize sentences. This method reads a input text frominput
and divides it into sentences and tokenizes them withTokenizer.SplitMode
.C.- Parameters:
input
- a reader of input text- Returns:
- a result of tokenizing
- Throws:
IOException
- if reading a stream is failed- See Also:
tokenizeSentences(SplitMode,Reader)
-
setDumpOutput
void setDumpOutput(PrintStream output)
Prints a lattice structure of analyzing.- Parameters:
output
- an output of printing
-
-