gate.corpora
Class TextualDocumentFormat

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractLanguageResource
              extended by gate.DocumentFormat
                  extended by gate.corpora.TextualDocumentFormat
All Implemented Interfaces:
LanguageResource, Resource, FeatureBearer, NameBearer, Serializable
Direct Known Subclasses:
EmailDocumentFormat, NekoHtmlDocumentFormat, SgmlDocumentFormat, XmlDocumentFormat

@CreoleResource(name="GATE Textual Document Format",
                isPrivate=true,
                autoinstances=)
public class TextualDocumentFormat
extends DocumentFormat

The format of Documents. Subclasses of DocumentFormat know about particular MIME types and how to unpack the information in any markup or formatting they contain into GATE annotations. Each MIME type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat, RtfDocumentFormat, MpegDocumentFormat. These classes register themselves with a static index residing here when they are constructed. Static getDocumentFormat methods can then be used to get the appropriate format class for a particular document.

See Also:
Serialized Form

Field Summary
 
Fields inherited from class gate.DocumentFormat
element2StringMap, magic2mimeTypeMap, markupElementsMap, mimeString2ClassHandlerMap, mimeString2mimeTypeMap, suffixes2mimeTypeMap
 
Fields inherited from class gate.creole.AbstractLanguageResource
dataStore, lrPersistentId
 
Fields inherited from class gate.creole.AbstractResource
name
 
Constructor Summary
TextualDocumentFormat()
          Default construction
 
Method Summary
 void annotateParagraphs(Document aDoc, int startOffset, int endOffset, String annotSetName)
          This method annotates paragraphs in a GATE document.
 DataStore getDataStore()
          Get the data store that this LR lives in.
protected static boolean hasContentButNoValidUrl(Document doc)
          This is a test to see if the GATE document has a valid URL or a valid content.
 Resource init()
          Initialise this resource, and return it.
protected  void setNewLineProperty(Document doc)
          Check the new line sequence and set document property.
 void unpackMarkup(Document doc)
          Unpack the markup in the document.
 void unpackMarkup(Document doc, RepositioningInfo repInfo, RepositioningInfo ampCodingInfo)
           
 
Methods inherited from class gate.DocumentFormat
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getMimeTypeForString, getShouldCollectRepositioning, getSupportedFileSuffixes, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, supportsRepositioning, unpackMarkup
 
Methods inherited from class gate.creole.AbstractLanguageResource
cleanup, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getInitParameterValues, getInitParameterValues, getName, getParameterValue, getParameterValue, getParameterValues, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 

Constructor Detail

TextualDocumentFormat

public TextualDocumentFormat()
Default construction

Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Initialise this resource, and return it.

Specified by:
init in interface Resource
Overrides:
init in class AbstractResource
Throws:
ResourceInstantiationException

unpackMarkup

public void unpackMarkup(Document doc)
                  throws DocumentFormatException
Unpack the markup in the document. This converts markup from the native format (e.g. XML, RTF) into annotations in GATE format. Uses the markupElementsMap to determine which elements to convert, and what annotation type names to use.

Specified by:
unpackMarkup in class DocumentFormat
Throws:
DocumentFormatException

unpackMarkup

public void unpackMarkup(Document doc,
                         RepositioningInfo repInfo,
                         RepositioningInfo ampCodingInfo)
                  throws DocumentFormatException
Specified by:
unpackMarkup in class DocumentFormat
Throws:
DocumentFormatException

hasContentButNoValidUrl

protected static boolean hasContentButNoValidUrl(Document doc)
                                          throws DocumentFormatException
This is a test to see if the GATE document has a valid URL or a valid content.

Parameters:
doc -
Throws:
DocumentFormatException

setNewLineProperty

protected void setNewLineProperty(Document doc)
Check the new line sequence and set document property.
Possible values are CRLF, LFCR, CR, LF


annotateParagraphs

public void annotateParagraphs(Document aDoc,
                               int startOffset,
                               int endOffset,
                               String annotSetName)
                        throws DocumentFormatException
This method annotates paragraphs in a GATE document. The investigated text spans beetween start and end offsets and the paragraph annotations are created in the annotSetName. If annotSetName is null then they are creted in the default annotation set.

Parameters:
aDoc - is the gate document on which the paragraph detection would be performed.If it is null or its content it's null then the method woul simply return doing nothing.
startOffset - is the index form the document content from which the paragraph detection will start
endOffset - is the offset where the detection will end.
annotSetName - is the name of the set in which paragraph annotation would be created.The annotation type created will be "paragraph"
Throws:
DocumentFormatException

getDataStore

public DataStore getDataStore()
Description copied from class: AbstractLanguageResource
Get the data store that this LR lives in. Null for transient LRs.

Specified by:
getDataStore in interface LanguageResource
Overrides:
getDataStore in class AbstractLanguageResource