Package de.unistuttgart.ims.uima.io.xml
Class GenericXmlReader<D extends org.apache.uima.jcas.cas.TOP>
java.lang.Object
de.unistuttgart.ims.uima.io.xml.GenericXmlReader<D>
public class GenericXmlReader<D extends org.apache.uima.jcas.cas.TOP>
extends java.lang.Object
This class is used to generate a UIMA document from arbitrary XML. The core
idea is to put all text content of the XML document in the document text of
the JCas, and to create annotations for each XML element, covering the exact
string the element contains. Consider, as an example, the XML fragment
<s><det>the</det> <n>dog</n></s>
.
In the JCas, this will be represented as the document text "the dog", with
three annotations of the type XMLElement
: One annotation covers the
entire string (and has the tag name s
as a feature), one
annotation covers "the" (tag name: det
), and one annotation
covers "dog" (tag name: n
). In addition, we store a CSS selector
for each annotation, which allows finding the element in the DOM tree. After
the initial conversion, rules can be applied to convert some XML elements to
other UIMA annotations. Rules are expressed in CSS-like syntax.
Text content
There are two modes in which the reader can operate. By default, the entire text content of all XML elements is considered to be the text. This can be changed by setting a "root selector", using the methodsetTextRootSelector(String)
. Setting a CSS selector with the
method then retrieves only the text within the selected element as the
text content of the UIMA document. If a root text root selector has been set,
the distinction between global and regular rules becomes relevant. Global
rules are applied on all XML nodes, while regular rules are only applied on
the XML nodes below the root selector.
Rule syntax
The CSS selectors are interpreted by the JSoup library. SeeSelector
for a detailed description.
Mapping rules
The most common rule type is a mapping rule. Mapping rules map an inline XML element onto a UIMA annotation type. Specifying, for instance,reader.addRule("token", Token.class)
(i.e., calling
addRule(String, Class)
) as a rule would result in UIMA annotations
of the type Token
to be added on top of every
<token>-element in the XML source. In many cases, code should be
executed while mapping. This code can be added in the form of a lambda
expression, using addRule(String, Class, BiConsumer)
.
Whitespace
The converter can operate in two modes that can be switched with the methodsetPreserveWhitespace(boolean)
. If this is
set to true, the whitespace is preserved exactly as in the original
XML. This is what you want if the goal is to re-export XML that is as similar
as possible. If that's not the case, the CAS can be made much nicer by
setting the option to false, which is also the default. In this case, block
elements (as defined in Visitor.blockElements
) get an extra newline
at the end.- Since:
- 1.0.0
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
GenericXmlReader.Rule<T extends org.apache.uima.jcas.cas.TOP>
This class represents the rules we apply -
Field Summary
Fields Modifier and Type Field Description protected java.util.function.Function<org.jsoup.nodes.Element,java.lang.Boolean>
ignoreFunction
-
Constructor Summary
Constructors Constructor Description GenericXmlReader(java.lang.Class<D> documentClass)
-
Method Summary
Modifier and Type Method Description <T extends org.apache.uima.jcas.cas.TOP>
voidaddGlobalRule(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,org.jsoup.nodes.Element> callback)
void
addGlobalRule(java.lang.String selector, java.util.function.BiConsumer<D,org.jsoup.nodes.Element> callback)
void
addRule(GenericXmlReader.Rule<?> rule)
<T extends org.apache.uima.jcas.cas.TOP>
voidaddRule(java.lang.String selector, java.lang.Class<T> targetClass)
This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass<T extends org.apache.uima.jcas.cas.TOP>
voidaddRule(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,org.jsoup.nodes.Element> callback)
This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass.protected <T extends org.apache.uima.jcas.cas.TOP>
voidapplyRule(org.apache.uima.jcas.JCas jcas, org.jsoup.nodes.Element rootElement, java.util.Map<java.lang.String,XMLElement> annoMap, GenericXmlReader.Rule<T> mapping)
boolean
exists(java.lang.String id)
Checks whether an XML id is definedjava.util.Map.Entry<org.jsoup.nodes.Element,org.apache.uima.cas.FeatureStructure>
getAnnotation(java.lang.String id)
Retrieves an annotation by XML idorg.jsoup.nodes.Document
getDocument()
protected <T extends org.apache.uima.jcas.cas.TOP>
TgetFeatureStructure(org.apache.uima.jcas.JCas jcas, XMLElement hAnno, org.jsoup.nodes.Element elm, GenericXmlReader.Rule<T> mapping)
java.util.function.Function<org.jsoup.nodes.Element,java.lang.Boolean>
getIgnoreFunction()
Returns the set ignore function.protected static <T extends org.apache.uima.jcas.cas.TOP>
TgetOrCreate(org.apache.uima.jcas.JCas jcas, java.lang.Class<T> targetClass)
java.lang.String
getTextRootSelector()
boolean
isPreserveWhitespace()
boolean
isSkipEmptyElements()
org.apache.uima.jcas.JCas
read(java.io.InputStream xmlStream)
Runs the conversion and executes all rules.org.apache.uima.jcas.JCas
read(org.apache.uima.jcas.JCas jcas, java.io.InputStream xmlStream)
Deprecated.void
setIgnoreFunction(java.util.function.Function<org.jsoup.nodes.Element,java.lang.Boolean> ignoreFunction)
The specified function is applied on each element.void
setPreserveWhitespace(boolean preserveWhitespace)
void
setSkipEmptyElements(boolean skipEmptyElements)
void
setTextRootSelector(java.lang.String textRootSelector)
-
Field Details
-
ignoreFunction
protected java.util.function.Function<org.jsoup.nodes.Element,java.lang.Boolean> ignoreFunction
-
-
Constructor Details
-
Method Details
-
read
public org.apache.uima.jcas.JCas read(java.io.InputStream xmlStream) throws java.io.IOException, org.apache.uima.UIMAExceptionRuns the conversion and executes all rules. Produces a new JCas.- Returns:
- The populated JCas object
- Throws:
java.io.IOException
- If the input stream errorsorg.apache.uima.UIMAException
- If there is an issue with creating the JCas.
-
read
@Deprecated public org.apache.uima.jcas.JCas read(org.apache.uima.jcas.JCas jcas, java.io.InputStream xmlStream) throws java.io.IOExceptionDeprecated.Runs the conversion and executes all rules.- Returns:
- Throws:
java.io.IOException
- If the input stream errors
-
addRule
-
addRule
public <T extends org.apache.uima.jcas.cas.TOP> void addRule(java.lang.String selector, java.lang.Class<T> targetClass)This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass -
addRule
public <T extends org.apache.uima.jcas.cas.TOP> void addRule(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,org.jsoup.nodes.Element> callback)This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass. In addition, a function can be defined to do something with the annotation and the element. -
addGlobalRule
public void addGlobalRule(java.lang.String selector, java.util.function.BiConsumer<D,org.jsoup.nodes.Element> callback) -
addGlobalRule
public <T extends org.apache.uima.jcas.cas.TOP> void addGlobalRule(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,org.jsoup.nodes.Element> callback) -
getAnnotation
public java.util.Map.Entry<org.jsoup.nodes.Element,org.apache.uima.cas.FeatureStructure> getAnnotation(java.lang.String id)Retrieves an annotation by XML id- Returns:
- The feature structure
-
exists
public boolean exists(java.lang.String id)Checks whether an XML id is defined- Returns:
- a boolean
-
getFeatureStructure
protected <T extends org.apache.uima.jcas.cas.TOP> T getFeatureStructure(org.apache.uima.jcas.JCas jcas, XMLElement hAnno, org.jsoup.nodes.Element elm, GenericXmlReader.Rule<T> mapping) -
applyRule
protected <T extends org.apache.uima.jcas.cas.TOP> void applyRule(org.apache.uima.jcas.JCas jcas, org.jsoup.nodes.Element rootElement, java.util.Map<java.lang.String,XMLElement> annoMap, GenericXmlReader.Rule<T> mapping) -
getTextRootSelector
public java.lang.String getTextRootSelector() -
setTextRootSelector
public void setTextRootSelector(java.lang.String textRootSelector) -
getDocument
public org.jsoup.nodes.Document getDocument() -
isPreserveWhitespace
public boolean isPreserveWhitespace() -
setPreserveWhitespace
public void setPreserveWhitespace(boolean preserveWhitespace) -
getOrCreate
protected static <T extends org.apache.uima.jcas.cas.TOP> T getOrCreate(org.apache.uima.jcas.JCas jcas, java.lang.Class<T> targetClass) -
getIgnoreFunction
public java.util.function.Function<org.jsoup.nodes.Element,java.lang.Boolean> getIgnoreFunction()Returns the set ignore function. @see #setIgnoreFunction(Function) for details.- Returns:
- The ignore function.
-
setIgnoreFunction
public void setIgnoreFunction(java.util.function.Function<org.jsoup.nodes.Element,java.lang.Boolean> ignoreFunction)The specified function is applied on each element. It can be used to skip some XML elements entirely. Skipped elements will not be represented in the JCas at all, and can not be used in rules. The main reason for using this function is to make processing faster if the XML file contains a large number of fine-grained, but unneeded tags. -
isSkipEmptyElements
public boolean isSkipEmptyElements() -
setSkipEmptyElements
public void setSkipEmptyElements(boolean skipEmptyElements)
-