gate.corpora
Class NekoHtmlDocumentFormat
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractLanguageResource
gate.DocumentFormat
gate.corpora.TextualDocumentFormat
gate.corpora.NekoHtmlDocumentFormat
- All Implemented Interfaces:
- LanguageResource, Resource, FeatureBearer, NameBearer, Serializable
@CreoleResource(name="GATE HTML Document Format",
isPrivate=true,
autoinstances=)
public class NekoHtmlDocumentFormat
- extends TextualDocumentFormat
DocumentFormat that uses Andy Clark's NekoHTML
parser to parse HTML documents. It tries to render HTML in a similar
way to a web browser, i.e. whitespace is normalized, paragraphs are
separated by a blank line, etc. By default the text content of style
and script tags is ignored completely, though the set of tags treated
in this way is configurable via a CREOLE parameter.
- See Also:
- Serialized Form
Methods inherited from class gate.DocumentFormat |
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getMimeTypeForString, getShouldCollectRepositioning, getSupportedFileSuffixes, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getInitParameterValues, getInitParameterValues, getName, getParameterValue, getParameterValue, getParameterValues, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
NekoHtmlDocumentFormat
public NekoHtmlDocumentFormat()
- Default construction
setIgnorableTags
@CreoleParameter(comment="HTML tags whose text content should be ignored",
defaultValue="script;style")
public void setIgnorableTags(Set<String> newTags)
getIgnorableTags
public Set<String> getIgnorableTags()
supportsRepositioning
public Boolean supportsRepositioning()
- We support repositioning info for HTML files.
- Overrides:
supportsRepositioning
in class DocumentFormat
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Old-style unpackMarkup, without repositioning info.
- Overrides:
unpackMarkup
in class TextualDocumentFormat
- Throws:
DocumentFormatException
unpackMarkup
public void unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format into annotations in GATE format. If the document was
created from a String, then is recomandable to set the doc's
sourceUrl to null. So, if the document has a valid URL,
then the parser will try to parse the XML document pointed by the
URL.If the URL is not valid, or is null, then the doc's content
will be parsed. If the doc's content is not a valid XML then the
parser might crash.
- Overrides:
unpackMarkup
in class TextualDocumentFormat
- Parameters:
doc
- The gate document you want to parse. If
doc.getSourceUrl()
returns null
then the content of doc will be parsed. Using a URL is
recomended because the parser will report errors corectlly
if the document is not well formed.
- Throws:
DocumentFormatException
init
public Resource init()
throws ResourceInstantiationException
- Initialise this resource, and return it.
- Specified by:
init
in interface Resource
- Overrides:
init
in class TextualDocumentFormat
- Throws:
ResourceInstantiationException