gate.corpora
Class CorpusImpl

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractLanguageResource
              extended by gate.corpora.CorpusImpl
All Implemented Interfaces:
Corpus, CustomDuplication, CreoleListener, LanguageResource, Resource, SimpleCorpus, FeatureBearer, NameBearer, Serializable, Iterable<Document>, Collection<Document>, EventListener, List<Document>

@CreoleResource(name="GATE Corpus",
                comment="GATE transient corpus.",
                interfaceName="gate.Corpus",
                icon="corpus-trans",
                helpURL="http://gate.ac.uk/userguide/sec:developer:loadlr")
public class CorpusImpl
extends AbstractLanguageResource
implements Corpus, CreoleListener, CustomDuplication

Corpora are sets of Document. They are ordered by lexicographic collation on Url.

See Also:
Serialized Form

Nested Class Summary
protected  class CorpusImpl.VerboseList
          A proxy list that stores the actual data in an internal list and forwards all operations to that one but it also fires the appropiate corpus events when necessary.
 
Field Summary
protected  List documentsList
           
protected  List<Document> supportList
          The underlying list that holds the documents in this corpus.
 
Fields inherited from class gate.creole.AbstractLanguageResource
dataStore, lrPersistentId
 
Fields inherited from class gate.creole.AbstractResource
name
 
Fields inherited from class gate.util.AbstractFeatureBearer
features
 
Fields inherited from interface gate.SimpleCorpus
CORPUS_DOCLIST_PARAMETER_NAME, CORPUS_NAME_PARAMETER_NAME
 
Constructor Summary
CorpusImpl()
           
 
Method Summary
 boolean add(Document o)
           
 void add(int index, Document element)
           
 boolean addAll(Collection c)
           
 boolean addAll(int index, Collection c)
           
 void addCorpusListener(CorpusListener l)
          Registers a new CorpusListener with this corpus.
 void cleanup()
          Construction
 void clear()
           
protected  void clearDocList()
           
 boolean contains(Object o)
           
 boolean containsAll(Collection c)
           
 void datastoreClosed(CreoleEvent e)
          Called when a DataStore has been closed
 void datastoreCreated(CreoleEvent e)
          Called when a DataStore has been created
 void datastoreOpened(CreoleEvent e)
          Called when a DataStore has been opened
 Resource duplicate(Factory.DuplicationContext ctx)
          Custom duplication for a corpus - duplicate this corpus in the usual way, then duplicate the documents in this corpus and add them to the duplicate.
 boolean equals(Object o)
           
protected  void fireDocumentAdded(CorpusEvent e)
           
protected  void fireDocumentRemoved(CorpusEvent e)
           
 Document get(int index)
           
 String getDocumentName(int index)
          Gets the name of a document in this corpus.
 List<String> getDocumentNames()
          Gets the names of the documents in this corpus.
 List getDocumentsList()
           
 int hashCode()
           
 int indexOf(Object o)
           
 Resource init()
          Initialise this resource, and return it.
 boolean isDocumentLoaded(int index)
          This method returns true when the document is already loaded in memory
 boolean isEmpty()
           
 Iterator iterator()
           
 int lastIndexOf(Object o)
           
 ListIterator listIterator()
           
 ListIterator listIterator(int index)
           
static void populate(Corpus corpus, URL directory, FileFilter filter, String encoding, boolean recurseDirectories)
          Fills the provided corpus with documents created on the fly from selected files in a directory.
static void populate(Corpus corpus, URL directory, FileFilter filter, String encoding, String mimeType, boolean recurseDirectories)
          Fills the provided corpus with documents created on the fly from selected files in a directory.
static long populate(Corpus corpus, URL singleConcatenatedFile, String documentRootElement, String encoding, int numberOfDocumentsToExtract, String documentNamePrefix, DocType documentType)
          Fills the provided corpus with documents extracted from the provided trec file.
 void populate(URL directory, FileFilter filter, String encoding, boolean recurseDirectories)
          Fills this corpus with documents created from files in a directory.
 void populate(URL directory, FileFilter filter, String encoding, String mimeType, boolean recurseDirectories)
          Fills this corpus with documents created from files in a directory.
 long populate(URL singleConcatenatedFile, String documentRootElement, String encoding, int numberOfFilesToExtract, String documentNamePrefix, DocType documentType)
          Fills the provided corpus with documents extracted from the provided single concatenated file.
 Document remove(int index)
           
 boolean remove(Object o)
           
 boolean removeAll(Collection c)
           
 void removeCorpusListener(CorpusListener l)
          Removes one of the listeners registered with this corpus.
 void resourceLoaded(CreoleEvent e)
          Called when a new Resource has been loaded into the system
 void resourceRenamed(Resource resource, String oldName, String newName)
          Called when the creole register has renamed a resource.1
 void resourceUnloaded(CreoleEvent e)
          Called when a Resource has been removed from the system
 boolean retainAll(Collection c)
           
 Document set(int index, Document element)
           
 void setDocumentsList(List documentsList)
           
 int size()
           
 List subList(int fromIndex, int toIndex)
           
 Object[] toArray()
           
 Object[] toArray(Object[] a)
           
 void unloadDocument(Document doc)
          This method does not make sense for transient corpora, so it does nothing.
 
Methods inherited from class gate.creole.AbstractLanguageResource
getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getInitParameterValues, getInitParameterValues, getName, getParameterValue, getParameterValue, getParameterValues, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class gate.util.AbstractFeatureBearer
getFeatures, setFeatures
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from interface gate.Resource
getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.FeatureBearer
getFeatures, setFeatures
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 

Field Detail

supportList

protected List<Document> supportList
The underlying list that holds the documents in this corpus.


documentsList

protected transient List documentsList
Constructor Detail

CorpusImpl

public CorpusImpl()
Method Detail

getDocumentNames

public List<String> getDocumentNames()
Gets the names of the documents in this corpus.

Specified by:
getDocumentNames in interface SimpleCorpus
Returns:
a CorpusImpl.VerboseList of Strings representing the names of the documents in this corpus.

getDocumentName

public String getDocumentName(int index)
Gets the name of a document in this corpus.

Specified by:
getDocumentName in interface SimpleCorpus
Parameters:
index - the index of the document
Returns:
a String value representing the name of the document at index in this corpus.

unloadDocument

public void unloadDocument(Document doc)
This method does not make sense for transient corpora, so it does nothing.

Specified by:
unloadDocument in interface Corpus
Parameters:
doc - Document to be unloaded from memory.

isDocumentLoaded

public boolean isDocumentLoaded(int index)
This method returns true when the document is already loaded in memory

Specified by:
isDocumentLoaded in interface Corpus

clearDocList

protected void clearDocList()

size

public int size()
Specified by:
size in interface Collection<Document>
Specified by:
size in interface List<Document>

isEmpty

public boolean isEmpty()
Specified by:
isEmpty in interface Collection<Document>
Specified by:
isEmpty in interface List<Document>

contains

public boolean contains(Object o)
Specified by:
contains in interface Collection<Document>
Specified by:
contains in interface List<Document>

iterator

public Iterator iterator()
Specified by:
iterator in interface Iterable<Document>
Specified by:
iterator in interface Collection<Document>
Specified by:
iterator in interface List<Document>

toArray

public Object[] toArray()
Specified by:
toArray in interface Collection<Document>
Specified by:
toArray in interface List<Document>

toArray

public Object[] toArray(Object[] a)
Specified by:
toArray in interface Collection<Document>
Specified by:
toArray in interface List<Document>

add

public boolean add(Document o)
Specified by:
add in interface Collection<Document>
Specified by:
add in interface List<Document>

remove

public boolean remove(Object o)
Specified by:
remove in interface Collection<Document>
Specified by:
remove in interface List<Document>

containsAll

public boolean containsAll(Collection c)
Specified by:
containsAll in interface Collection<Document>
Specified by:
containsAll in interface List<Document>

addAll

public boolean addAll(Collection c)
Specified by:
addAll in interface Collection<Document>
Specified by:
addAll in interface List<Document>

addAll

public boolean addAll(int index,
                      Collection c)
Specified by:
addAll in interface List<Document>

removeAll

public boolean removeAll(Collection c)
Specified by:
removeAll in interface Collection<Document>
Specified by:
removeAll in interface List<Document>

retainAll

public boolean retainAll(Collection c)
Specified by:
retainAll in interface Collection<Document>
Specified by:
retainAll in interface List<Document>

clear

public void clear()
Specified by:
clear in interface Collection<Document>
Specified by:
clear in interface List<Document>

equals

public boolean equals(Object o)
Specified by:
equals in interface Collection<Document>
Specified by:
equals in interface List<Document>
Overrides:
equals in class Object

hashCode

public int hashCode()
Specified by:
hashCode in interface Collection<Document>
Specified by:
hashCode in interface List<Document>
Overrides:
hashCode in class Object

get

public Document get(int index)
Specified by:
get in interface List<Document>

set

public Document set(int index,
                    Document element)
Specified by:
set in interface List<Document>

add

public void add(int index,
                Document element)
Specified by:
add in interface List<Document>

remove

public Document remove(int index)
Specified by:
remove in interface List<Document>

indexOf

public int indexOf(Object o)
Specified by:
indexOf in interface List<Document>

lastIndexOf

public int lastIndexOf(Object o)
Specified by:
lastIndexOf in interface List<Document>

listIterator

public ListIterator listIterator()
Specified by:
listIterator in interface List<Document>

listIterator

public ListIterator listIterator(int index)
Specified by:
listIterator in interface List<Document>

subList

public List subList(int fromIndex,
                    int toIndex)
Specified by:
subList in interface List<Document>

cleanup

public void cleanup()
Construction

Specified by:
cleanup in interface Resource
Overrides:
cleanup in class AbstractLanguageResource

init

public Resource init()
Initialise this resource, and return it.

Specified by:
init in interface Resource
Overrides:
init in class AbstractResource

populate

public static void populate(Corpus corpus,
                            URL directory,
                            FileFilter filter,
                            String encoding,
                            boolean recurseDirectories)
                     throws IOException
Fills the provided corpus with documents created on the fly from selected files in a directory. Uses a FileFilter to select which files will be used and which will be ignored. A simple file filter based on extensions is provided in the Gate distribution ( ExtensionFileFilter).

Parameters:
corpus - the corpus to be populated
directory - the directory from which the files will be picked. This parameter is an URL for uniformity. It needs to be a URL of type file otherwise an InvalidArgumentException will be thrown.
filter - the file filter used to select files from the target directory. If the filter is null all the files will be accepted.
encoding - the encoding to be used for reading the documents
recurseDirectories - should the directory be parsed recursively?. If true all the files from the provided directory and all its children directories (on as many levels as necessary) will be picked if accepted by the filter otherwise the children directories will be ignored.
Throws:
IOException - if a file doesn't exist

populate

public static void populate(Corpus corpus,
                            URL directory,
                            FileFilter filter,
                            String encoding,
                            String mimeType,
                            boolean recurseDirectories)
                     throws IOException
Fills the provided corpus with documents created on the fly from selected files in a directory. Uses a FileFilter to select which files will be used and which will be ignored. A simple file filter based on extensions is provided in the Gate distribution ( ExtensionFileFilter).

Parameters:
corpus - the corpus to be populated
directory - the directory from which the files will be picked. This parameter is an URL for uniformity. It needs to be a URL of type file otherwise an InvalidArgumentException will be thrown.
filter - the file filter used to select files from the target directory. If the filter is null all the files will be accepted.
encoding - the encoding to be used for reading the documents
recurseDirectories - should the directory be parsed recursively?. If true all the files from the provided directory and all its children directories (on as many levels as necessary) will be picked if accepted by the filter otherwise the children directories will be ignored.
Throws:
IOException - if a file doesn't exist

populate

public void populate(URL directory,
                     FileFilter filter,
                     String encoding,
                     boolean recurseDirectories)
              throws IOException,
                     ResourceInstantiationException
Fills this corpus with documents created from files in a directory.

Specified by:
populate in interface SimpleCorpus
Parameters:
filter - the file filter used to select files from the target directory. If the filter is null all the files will be accepted.
directory - the directory from which the files will be picked. This parameter is an URL for uniformity. It needs to be a URL of type file otherwise an InvalidArgumentException will be thrown. An implementation for this method is provided as a static method at populate(Corpus, URL, FileFilter, String, boolean) .
encoding - the encoding to be used for reading the documents
recurseDirectories - should the directory be parsed recursively?. If true all the files from the provided directory and all its children directories (on as many levels as necessary) will be picked if accepted by the filter otherwise the children directories will be ignored.
Throws:
IOException
ResourceInstantiationException

populate

public void populate(URL directory,
                     FileFilter filter,
                     String encoding,
                     String mimeType,
                     boolean recurseDirectories)
              throws IOException,
                     ResourceInstantiationException
Fills this corpus with documents created from files in a directory.

Specified by:
populate in interface SimpleCorpus
Parameters:
filter - the file filter used to select files from the target directory. If the filter is null all the files will be accepted.
directory - the directory from which the files will be picked. This parameter is an URL for uniformity. It needs to be a URL of type file otherwise an InvalidArgumentException will be thrown. An implementation for this method is provided as a static method at populate(Corpus, URL, FileFilter, String, boolean) .
encoding - the encoding to be used for reading the documents
mimeType - the mime type to be used when loading documents. If null, then the mime type will be detected automatically.
recurseDirectories - should the directory be parsed recursively?. If true all the files from the provided directory and all its children directories (on as many levels as necessary) will be picked if accepted by the filter otherwise the children directories will be ignored.
Throws:
IOException
ResourceInstantiationException

populate

public static long populate(Corpus corpus,
                            URL singleConcatenatedFile,
                            String documentRootElement,
                            String encoding,
                            int numberOfDocumentsToExtract,
                            String documentNamePrefix,
                            DocType documentType)
                     throws IOException
Fills the provided corpus with documents extracted from the provided trec file.

Parameters:
corpus - the corpus to be populated.
singleConcatenatedFile - the trec file.
documentRootElement - text between this element (start and end) is considered for creating a new document.
encoding - the encoding of the trec file.
numberOfDocumentsToExtract - extracts the specified number of documents from the trecweb file; -1 to indicate all files.
documentType - type of the document it is (i.e. xml, html etc)
Returns:
total length of populated documents in the corpus in number of bytes
Throws:
IOException

populate

public long populate(URL singleConcatenatedFile,
                     String documentRootElement,
                     String encoding,
                     int numberOfFilesToExtract,
                     String documentNamePrefix,
                     DocType documentType)
              throws IOException,
                     ResourceInstantiationException
Fills the provided corpus with documents extracted from the provided single concatenated file.

Specified by:
populate in interface SimpleCorpus
Parameters:
singleConcatenatedFile - the single concatenated file to load.
documentRootElement - content between the start and end of this element is considered for documents.
encoding - the encoding of the trec file.
numberOfFilesToExtract - indicates the number of files to extract from the trecweb file.
documentNamePrefix - the prefix to use for document names when creating from
documentType - type of the document it is (i.e. html, xml)
Returns:
total length of populated documents in the corpus in number of bytes
Throws:
IOException
ResourceInstantiationException

removeCorpusListener

public void removeCorpusListener(CorpusListener l)
Description copied from interface: Corpus
Removes one of the listeners registered with this corpus.

Specified by:
removeCorpusListener in interface Corpus
Parameters:
l - the listener to be removed.

addCorpusListener

public void addCorpusListener(CorpusListener l)
Description copied from interface: Corpus
Registers a new CorpusListener with this corpus.

Specified by:
addCorpusListener in interface Corpus
Parameters:
l - the listener to be added.

duplicate

public Resource duplicate(Factory.DuplicationContext ctx)
                   throws ResourceInstantiationException
Custom duplication for a corpus - duplicate this corpus in the usual way, then duplicate the documents in this corpus and add them to the duplicate.

Specified by:
duplicate in interface CustomDuplication
Parameters:
ctx - the current duplication context. If an implementation of this method needs to duplicate any other resources as part of the custom duplication process it should pass this context back to the two-argument form of Factory.duplicate rather than using the single-argument form.
Returns:
an independent copy of this resource.
Throws:
ResourceInstantiationException

fireDocumentAdded

protected void fireDocumentAdded(CorpusEvent e)

fireDocumentRemoved

protected void fireDocumentRemoved(CorpusEvent e)

setDocumentsList

@Optional
@CreoleParameter(collectionElementType=Document.class,
                 comment="A list of GATE documents")
public void setDocumentsList(List documentsList)

getDocumentsList

public List getDocumentsList()

resourceLoaded

public void resourceLoaded(CreoleEvent e)
Description copied from interface: CreoleListener
Called when a new Resource has been loaded into the system

Specified by:
resourceLoaded in interface CreoleListener

resourceUnloaded

public void resourceUnloaded(CreoleEvent e)
Description copied from interface: CreoleListener
Called when a Resource has been removed from the system

Specified by:
resourceUnloaded in interface CreoleListener

resourceRenamed

public void resourceRenamed(Resource resource,
                            String oldName,
                            String newName)
Description copied from interface: CreoleListener
Called when the creole register has renamed a resource.1

Specified by:
resourceRenamed in interface CreoleListener

datastoreOpened

public void datastoreOpened(CreoleEvent e)
Description copied from interface: CreoleListener
Called when a DataStore has been opened

Specified by:
datastoreOpened in interface CreoleListener

datastoreCreated

public void datastoreCreated(CreoleEvent e)
Description copied from interface: CreoleListener
Called when a DataStore has been created

Specified by:
datastoreCreated in interface CreoleListener

datastoreClosed

public void datastoreClosed(CreoleEvent e)
Description copied from interface: CreoleListener
Called when a DataStore has been closed

Specified by:
datastoreClosed in interface CreoleListener