PennTreebankTokenizer (Smile NLP 1.0.3 API)

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- smile.nlp.tokenizer.PennTreebankTokenizer

All Implemented Interfaces:

Tokenizer
```
public class PennTreebankTokenizer
extends Object
implements Tokenizer
```
A word tokenizer that tokenizes English sentences using the conventions used by the Penn Treebank. Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. Examples
- children's --> children 's
- parents' --> parents '
- won't --> wo n't
- can't -> ca n't
- weren't -> were n't
- cannot -> can not
- 'tisn't -> 't is n't
- 'tis -> 't is
- gonna --> gon na
- I'm --> I 'm
- he'll -> he 'll
This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string or before newline -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.
Author:

Haifeng Li

Method Summary

Methods
Modifier and Type	Method and Description
`static PennTreebankTokenizer`	`getInstance()` Returns the singleton instance.
`String[]`	`split(String text)` Divide the given string into a list of substrings.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - getInstance
```
public static PennTreebankTokenizer getInstance()
```
    Returns the singleton instance.
  - split
```
public String[] split(String text)
```
    Description copied from interface: Tokenizer
    
    Divide the given string into a list of substrings.
    
    Specified by:
    
    split in interface Tokenizer

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

Copyright © 2015. All rights reserved.