gate.corpora
Class NekoHtmlDocumentFormat

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractLanguageResource
              extended by gate.DocumentFormat
                  extended by gate.corpora.TextualDocumentFormat
                      extended by gate.corpora.NekoHtmlDocumentFormat
All Implemented Interfaces:
LanguageResource, Resource, FeatureBearer, NameBearer, Serializable

@CreoleResource(name="GATE HTML Document Format",
                isPrivate=true,
                autoinstances=)
public class NekoHtmlDocumentFormat
extends TextualDocumentFormat

DocumentFormat that uses Andy Clark's NekoHTML parser to parse HTML documents. It tries to render HTML in a similar way to a web browser, i.e. whitespace is normalized, paragraphs are separated by a blank line, etc. By default the text content of style and script tags is ignored completely, though the set of tags treated in this way is configurable via a CREOLE parameter.

See Also:
Serialized Form

Field Summary
 
Fields inherited from class gate.DocumentFormat
element2StringMap, magic2mimeTypeMap, markupElementsMap, mimeString2ClassHandlerMap, mimeString2mimeTypeMap, suffixes2mimeTypeMap
 
Fields inherited from class gate.creole.AbstractLanguageResource
dataStore, lrPersistentId
 
Fields inherited from class gate.creole.AbstractResource
name
 
Constructor Summary
NekoHtmlDocumentFormat()
          Default construction
 
Method Summary
 Set<String> getIgnorableTags()
           
 Resource init()
          Initialise this resource, and return it.
 void setIgnorableTags(Set<String> newTags)
           
 Boolean supportsRepositioning()
          We support repositioning info for HTML files.
 void unpackMarkup(Document doc)
          Old-style unpackMarkup, without repositioning info.
 void unpackMarkup(Document doc, RepositioningInfo repInfo, RepositioningInfo ampCodingInfo)
          Unpack the markup in the document.
 
Methods inherited from class gate.corpora.TextualDocumentFormat
annotateParagraphs, getDataStore, hasContentButNoValidUrl, setNewLineProperty
 
Methods inherited from class gate.DocumentFormat
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getMimeTypeForString, getShouldCollectRepositioning, getSupportedFileSuffixes, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup
 
Methods inherited from class gate.creole.AbstractLanguageResource
cleanup, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getInitParameterValues, getInitParameterValues, getName, getParameterValue, getParameterValue, getParameterValues, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 

Constructor Detail

NekoHtmlDocumentFormat

public NekoHtmlDocumentFormat()
Default construction

Method Detail

setIgnorableTags

@CreoleParameter(comment="HTML tags whose text content should be ignored",
                 defaultValue="script;style")
public void setIgnorableTags(Set<String> newTags)

getIgnorableTags

public Set<String> getIgnorableTags()

supportsRepositioning

public Boolean supportsRepositioning()
We support repositioning info for HTML files.

Overrides:
supportsRepositioning in class DocumentFormat

unpackMarkup

public void unpackMarkup(Document doc)
                  throws DocumentFormatException
Old-style unpackMarkup, without repositioning info.

Overrides:
unpackMarkup in class TextualDocumentFormat
Throws:
DocumentFormatException

unpackMarkup

public void unpackMarkup(Document doc,
                         RepositioningInfo repInfo,
                         RepositioningInfo ampCodingInfo)
                  throws DocumentFormatException
Unpack the markup in the document. This converts markup from the native format into annotations in GATE format. If the document was created from a String, then is recomandable to set the doc's sourceUrl to null. So, if the document has a valid URL, then the parser will try to parse the XML document pointed by the URL.If the URL is not valid, or is null, then the doc's content will be parsed. If the doc's content is not a valid XML then the parser might crash.

Overrides:
unpackMarkup in class TextualDocumentFormat
Parameters:
doc - The gate document you want to parse. If doc.getSourceUrl() returns null then the content of doc will be parsed. Using a URL is recomended because the parser will report errors corectlly if the document is not well formed.
Throws:
DocumentFormatException

init

public Resource init()
              throws ResourceInstantiationException
Initialise this resource, and return it.

Specified by:
init in interface Resource
Overrides:
init in class TextualDocumentFormat
Throws:
ResourceInstantiationException