A word tokenizer that tokenizes English sentences using the conventions
used by the Penn Treebank. Most punctuation is split from adjoining words.
Verb contractions and the Anglo-Saxon genitive of nouns are split into their
component morphemes, and each morpheme is tagged separately. Examples
- children's --> children 's
- parents' --> parents '
- won't --> wo n't
- can't -> ca n't
- weren't -> were n't
- cannot -> can not
- 'tisn't -> 't is n't
- 'tis -> 't is
- gonna --> gon na
- I'm --> I 'm
- he'll -> he 'll
This tokenizer assumes that the text has already been segmented into
sentences. Any periods -- apart from those at the end of a string or before
newline -- are assumed to be part of the word they are attached to (e.g. for
abbreviations, etc), and are not separately tokenized.