public class NYTCorpusDocumentParser extends Object
Class for parsing New York Times articles from NITF files.
The original version contained a possible memory leak:
the BufferedReader object was not closed.
Additionally, an API to read from an InputStream
was
added.
Modifier and Type | Field and Description |
---|---|
static String |
DATE_PUBLICATION_ATTRIBUTE
NITF Constant
|
Constructor and Description |
---|
NYTCorpusDocumentParser() |
Modifier and Type | Method and Description |
---|---|
NYTCorpusDocument |
fromByteArray(byte[] bytes,
boolean validating) |
NYTCorpusDocument |
parseNYTCorpusDocumentFromDOMDocument(File file,
Document document) |
NYTCorpusDocument |
parseNYTCorpusDocumentFromDOMDocument(InputStream is,
Document document) |
NYTCorpusDocument |
parseNYTCorpusDocumentFromFile(File file,
boolean validating)
Parse an New York Times Document from a file.
|
NYTCorpusDocument |
parseNYTCorpusDocumentFromFile(InputStream is,
boolean validating)
Parse an New York Times Document from an
InputStream . |
public static final String DATE_PUBLICATION_ATTRIBUTE
public NYTCorpusDocument fromByteArray(byte[] bytes, boolean validating)
public NYTCorpusDocument parseNYTCorpusDocumentFromFile(File file, boolean validating)
file
- The file from which to parse the document.validating
- True if the file is to be validated against the nitf DTD and
false if it is not. It is recommended that validation be
disabled, as all documents in the corpus have previously been
validated against the NITF DTD.public NYTCorpusDocument parseNYTCorpusDocumentFromDOMDocument(File file, Document document)
public NYTCorpusDocument parseNYTCorpusDocumentFromFile(InputStream is, boolean validating)
InputStream
.is
- The InputStream
from which to parse the document.validating
- True if the file is to be validated against the nitf DTD and
false if it is not. It is recommended that validation be
disabled, as all documents in the corpus have previously been
validated against the NITF DTD.public NYTCorpusDocument parseNYTCorpusDocumentFromDOMDocument(InputStream is, Document document)
Copyright © 2015 Johns Hopkins University HLTCOE. All rights reserved.