objectWaterlooSegmentScript extends App with Logging
Script used to segment waterloo corpus on a sentence level.
Splits docs based on <DOC> ... </DOC> tags, determines whether the doc is in "English" by
counting the fraction of stop words, and throws out the doc if it is not. Sentence segments the
doc using nlp stack, wraps each sentence in <SENT> ... </SENT> tags, and then rewrites the
entire doc to file.
Script used to segment waterloo corpus on a sentence level. Splits docs based on <DOC> ... </DOC> tags, determines whether the doc is in "English" by counting the fraction of stop words, and throws out the doc if it is not. Sentence segments the doc using nlp stack, wraps each sentence in <SENT> ... </SENT> tags, and then rewrites the entire doc to file.