public abstract class PepperImporterImpl extends PepperModuleImpl implements PepperImporter
An importer in Pepper reads data from a format A and maps its data to a Salt
model. An importer must implement the class PepperImporter
and can
extend the this class. We strongly recommend to extend this class, since it
contains a lot of helpful functions and methods controlling the workflow.
PepperImporter
Modifier and Type | Field and Description |
---|---|
protected CorpusDesc |
corpusDesc
TODO make docu
|
isMultithreaded, logger, moduleController, resources, saltProject, sCorpusGraph, symbolicName, temproraries
NEGATIVE_FILE_EXTENSION_MARKER
ENDING_ALL_FILES, ENDING_FOLDER, ENDING_LEAF_FOLDER, ENDING_TAB, ENDING_TXT, ENDING_XML
Modifier | Constructor and Description |
---|---|
protected |
PepperImporterImpl()
Creates a
PepperModule of type MODULE_TYPE.IMPORTER . |
protected |
PepperImporterImpl(String name)
Creates a
PepperModule of type MODULE_TYPE.IMPORTER and
sets is name to the passed one. |
Modifier and Type | Method and Description |
---|---|
FormatDesc |
addSupportedFormat(String formatName,
String formatVersion,
org.eclipse.emf.common.util.URI formatReference) |
CorpusDesc |
getCorpusDesc()
TODO docu
|
Collection<String> |
getCorpusEndings()
Returns a collection of all file endings for a
SCorpus object. |
Collection<String> |
getDocumentEndings()
Returns list containing all format endings for files, which are
importable and could be mapped to
SDocument or
SDocumentGraph objects by this Pepper module. |
Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI> |
getIdentifier2ResourceTable()
Stores
Identifier objects corresponding to either a
SDocument or a SCorpus object, which has been created
during the run of PepperImporter.importCorpusStructure(SCorpusGraph) . |
Collection<String> |
getIgnoreEndings()
Returns a collection of filenames, not to be imported.
|
List<FormatDesc> |
getSupportedFormats()
Returns a list of formats, which are importable by this
PepperImporter object. |
void |
importCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph)
This method is called by Pepper at the start of a conversion process to
create the corpus-structure.
|
protected Boolean |
importCorpusStructureRec(org.eclipse.emf.common.util.URI currURI,
org.corpus_tools.salt.common.SCorpus parent)
Top down traversal in file given structure.
|
Double |
isImportable(org.eclipse.emf.common.util.URI corpusPath)
This method is called by Pepper and returns if a corpus located at the
given
URI is importable by this importer. |
protected void |
readXMLResource(DefaultHandler2 contentHandler,
org.eclipse.emf.common.util.URI documentLocation)
Helper method to read an xml file with a
DefaultHandler2
implementation given as contentHandler. |
protected Collection<String> |
sampleFileContent(org.eclipse.emf.common.util.URI corpusPath,
String... fileEndings)
Returns lines of a
sampled set of files
having the ending specified by
fileEndings recursively from
specified corpus path. |
void |
setCorpusDesc(CorpusDesc newCorpusDefinition)
TODO docu
|
void |
setCorpusPathResolver(CorpusPathResolver corpusPathResolver)
Sets a
CorpusPathResolver which is used by
isImportable(URI) . |
org.corpus_tools.salt.SALT_TYPE |
setTypeOfResource(org.eclipse.emf.common.util.URI resource)
This method is a callback and can be overridden by derived importers.
|
void |
start()
Overrides the method
PepperModuleImpl.start() to add the
following, before PepperModuleImpl.start() is called. |
activate, createPepperMapper, done, done, end, getComponentContext, getCorpusGraph, getDesc, getDocumentId2DC, getFingerprint, getMapperControllers, getMapperThreadGroup, getModuleController, getModuleType, getName, getProgress, getProgress, getProperties, getResources, getSaltProject, getSelfTestDesc, getStartProblems, getSupplierContact, getSupplierHomepage, getSymbolicName, getTemproraries, getVersion, isMultithreaded, isReadyToStart, proposeImportOrder, setCorpusGraph, setDesc, setIsMultithreaded, setMapperThreadGroup, setName, setPepperModuleController_basic, setPepperModuleController, setProperties, setResources, setSaltProject, setSupplierContact, setSupplierHomepage, setSymbolicName, setTemproraries, setVersion, start, toString, uncaughtException
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
createPepperMapper, done, done, end, getComponentContext, getCorpusGraph, getDesc, getFingerprint, getModuleController, getModuleType, getName, getProgress, getProgress, getProperties, getResources, getSaltProject, getSelfTestDesc, getStartProblems, getSupplierContact, getSupplierHomepage, getSymbolicName, getTemproraries, getVersion, isMultithreaded, isReadyToStart, proposeImportOrder, setCorpusGraph, setDesc, setIsMultithreaded, setPepperModuleController_basic, setPepperModuleController, setProperties, setResources, setSaltProject, setSupplierContact, setSupplierHomepage, setSymbolicName, setTemproraries, setVersion, start
protected CorpusDesc corpusDesc
protected PepperImporterImpl()
PepperModule
of type MODULE_TYPE.IMPORTER
. The
name is set to "MyImporter".
PepperImporterImpl(String)
and pass a proper
name.protected PepperImporterImpl(String name)
PepperModule
of type MODULE_TYPE.IMPORTER
and
sets is name to the passed one.public List<FormatDesc> getSupportedFormats()
PepperImporter
object.getSupportedFormats
in interface PepperImporter
public FormatDesc addSupportedFormat(String formatName, String formatVersion, org.eclipse.emf.common.util.URI formatReference)
addSupportedFormat
in interface PepperImporter
public CorpusDesc getCorpusDesc()
getCorpusDesc
in interface PepperImporter
public void setCorpusDesc(CorpusDesc newCorpusDefinition)
setCorpusDesc
in interface PepperImporter
public Map<org.corpus_tools.salt.graph.Identifier,org.eclipse.emf.common.util.URI> getIdentifier2ResourceTable()
Identifier
objects corresponding to either a
SDocument
or a SCorpus
object, which has been created
during the run of PepperImporter.importCorpusStructure(SCorpusGraph)
.
Corresponding to the Identifier
object this table stores the
resource from where the element shall be imported.corpus_1 | /home/me/corpora/myCorpus |
corpus_2 | /home/me/corpora/myCorpus/subcorpus |
doc_1 | /home/me/corpora/myCorpus/subcorpus/document1.xml |
doc_2 | /home/me/corpora/myCorpus/subcorpus/document2.xml |
getIdentifier2ResourceTable
in interface PepperImporter
public void importCorpusStructure(org.corpus_tools.salt.common.SCorpusGraph corpusGraph) throws PepperModuleException
SCorpus
), documents
(represented represented via the Salt element SDocument
) and a
linking between corpora and a corpus and a document (represented via the
Salt element SCorpusRelation
and SCorpusDocumentRelation
). Each corpus corpus can contain 0..* subcorpus and 0..* documents, but
a corpus cannot contain both document and corpus. PepperImporter.setTypeOfResource(URI)
is called to set the type of the
resource. If the type is a SALT_TYPE.SDOCUMENT
a
SDocument
object is created for the resource, if the type is a
SALT_TYPE.SCORPUS
a SCorpus
object is created, if the
type is null, the resource is ignored.importCorpusStructure
in interface PepperImporter
corpusGraph
- an empty graph given by Pepper, which shall contains the
corpus structurePepperModuleException
protected Boolean importCorpusStructureRec(org.eclipse.emf.common.util.URI currURI, org.corpus_tools.salt.common.SCorpus parent)
importCorpusStructure(SCorpusGraph)
and creates the
corpus-structure via a top down traversal in file structure. For each
found file (real file and folder), the method
setTypeOfResource(URI)
is called to set the type of the
resource. If the type is a SALT_TYPE.SDOCUMENT
a
SDocument
object is created for the resource, if the type is a
SALT_TYPE.SCORPUS
a SCorpus
object is created, if the
type is null, the resource is ignored.currURI
- parentsID
- endings
- IOException
public void start() throws PepperModuleException
PepperModuleImpl.start()
to add the
following, before PepperModuleImpl.start()
is called.
start
in interface PepperModule
start
in class PepperModuleImpl
PepperModuleException
public Collection<String> getDocumentEndings()
SDocument
or
SDocumentGraph
objects by this Pepper module.getDocumentEndings
in interface PepperImporter
public Collection<String> getCorpusEndings()
SCorpus
object.
See . This list contains per default value
. To remove the default value, call
Collection.remove(Object)
on PepperImporter.getCorpusEndings()
. To add
endings to the collection, call Collection#add(Ending)
and to
remove endings from the collection, call
Collection#remove(Ending)
.getCorpusEndings
in interface PepperImporter
public org.corpus_tools.salt.SALT_TYPE setTypeOfResource(org.eclipse.emf.common.util.URI resource)
PepperImporter.importCorpusStructure(SCorpusGraph)
). During the traversal of
the file-structure the method
PepperImporter.importCorpusStructure(SCorpusGraph)
calls this method for each
resource, to determine if the resource either represents a
SCorpus
, a SDocument
object or shall be ignored. PepperImporter.getDocumentEndings()
SALT_TYPE.SDOCUMENT
is returned
PepperImporter.getCorpusEndings()
SALT_TYPE#SCorpus
is returnedPepperImporter.getDocumentEndings()
contains PepperModule.ENDING_ALL_FILES
,
for each file (which is not a folder) SALT_TYPE.SDOCUMENT
is
returnedPepperImporter.getDocumentEndings()
contains PepperModule.ENDING_LEAF_FOLDER
, for each leaf folder SALT_TYPE.SDOCUMENT
is returnedPepperImporter.getCorpusEndings()
contains PepperModule.ENDING_FOLDER
, for
each folder SALT_TYPE.SCORPUS
is returnedsetTypeOfResource
in interface PepperImporter
resource
- URI
resource to be specifiedSALT_TYPE.SCORPUS
if resource represents a
SCorpus
object, SALT_TYPE.SDOCUMENT
if resource
represents a SDocument
object or null, if it shall be
igrnored.public Collection<String> getIgnoreEndings()
Collection#add(Ending)
and to remove endings from the collection,
call Collection#remove(Ending)
. .getIgnoreEndings
in interface PepperImporter
protected void readXMLResource(DefaultHandler2 contentHandler, org.eclipse.emf.common.util.URI documentLocation)
DefaultHandler2
implementation given as contentHandler. It is assumed, that the
file encoding is set to UTF-8.contentHandler
- DefaultHandler2
implementationdocumentLocation
- location of the xml-filepublic Double isImportable(org.eclipse.emf.common.util.URI corpusPath)
URI
is importable by this importer. If yes, 1 must be
returned, if no 0 must be returned. If it is not quite sure, if the given
corpus is importable by this importer any value between 0 and 1 can be
returned. If this method is not overridden, null is returned.isImportable
in interface PepperImporter
public void setCorpusPathResolver(CorpusPathResolver corpusPathResolver)
CorpusPathResolver
which is used by
isImportable(URI)
. With a CorpusPathResolver
it is
possible, to share read lines of files between multiple importers. Doing
this saves time for retrieving the content of the corpus path and the
reading of the first x lines of the files.corpusPathResolver
- protected Collection<String> sampleFileContent(org.eclipse.emf.common.util.URI corpusPath, String... fileEndings)
fileEndings
recursively from
specified corpus path.
This method only delegates to
IsImportableUtil#sampleFileContent(URI, int, int, String...)
. The
class IsImportableUtil
also contains further helper methods, in
case this method is too unprecise.
corpusPath
- directory to be searched infileEndings
- endings to be considered. If no endings specified, all files
are considerednumberOfLines
lines of
numberOfSampledFiles
filesCopyright © 2009–2019 Humboldt-Universität zu Berlin, INRIA. All rights reserved.