Class GenericXmlReader<D extends org.apache.uima.jcas.cas.TOP>

java.lang.Object
de.unistuttgart.ims.uima.io.xml.GenericXmlReader<D>

public class GenericXmlReader<D extends org.apache.uima.jcas.cas.TOP>
extends java.lang.Object
This class is used to generate a UIMA document from arbitrary XML. The core idea is to put all text content of the XML document in the document text of the JCas, and to create annotations for each XML element, covering the exact string the element contains. Consider, as an example, the XML fragment <s><det>the</det> <n>dog</n></s>. In the JCas, this will be represented as the document text "the dog", with three annotations of the type XMLElement: One annotation covers the entire string (and has the tag name s as a feature), one annotation covers "the" (tag name: det), and one annotation covers "dog" (tag name: n). In addition, we store a CSS selector for each annotation, which allows finding the element in the DOM tree. After the initial conversion, rules can be applied to convert some XML elements to other UIMA annotations. Rules are expressed in CSS-like syntax.

Text content

There are two modes in which the reader can operate. By default, the entire text content of all XML elements is considered to be the text. This can be changed by setting a "root selector", using the method setTextRootSelector(String). Setting a CSS selector with the method then retrieves only the text within the selected element as the text content of the UIMA document. If a root text root selector has been set, the distinction between global and regular rules becomes relevant. Global rules are applied on all XML nodes, while regular rules are only applied on the XML nodes below the root selector.

Rule syntax

The CSS selectors are interpreted by the JSoup library. See Selector for a detailed description.

Mapping rules

The most common rule type is a mapping rule. Mapping rules map an inline XML element onto a UIMA annotation type. Specifying, for instance, reader.addRule("token", Token.class) (i.e., calling addRule(String, Class)) as a rule would result in UIMA annotations of the type Token to be added on top of every <token>-element in the XML source. In many cases, code should be executed while mapping. This code can be added in the form of a lambda expression, using addRule(String, Class, BiConsumer).

Whitespace

The converter can operate in two modes that can be switched with the method setPreserveWhitespace(boolean). If this is set to true, the whitespace is preserved exactly as in the original XML. This is what you want if the goal is to re-export XML that is as similar as possible. If that's not the case, the CAS can be made much nicer by setting the option to false, which is also the default. In this case, block elements (as defined in Visitor.blockElements) get an extra newline at the end.
Since:
1.0.0
  • Nested Class Summary

    Nested Classes 
    Modifier and Type Class Description
    static class  GenericXmlReader.Rule<T extends org.apache.uima.jcas.cas.TOP>
    This class represents the rules we apply
  • Field Summary

    Fields 
    Modifier and Type Field Description
    protected java.util.function.Function<org.jsoup.nodes.Element,​java.lang.Boolean> ignoreFunction  
  • Constructor Summary

    Constructors 
    Constructor Description
    GenericXmlReader​(java.lang.Class<D> documentClass)  
  • Method Summary

    Modifier and Type Method Description
    <T extends org.apache.uima.jcas.cas.TOP>
    void
    addGlobalRule​(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,​org.jsoup.nodes.Element> callback)  
    void addGlobalRule​(java.lang.String selector, java.util.function.BiConsumer<D,​org.jsoup.nodes.Element> callback)  
    void addRule​(GenericXmlReader.Rule<?> rule)  
    <T extends org.apache.uima.jcas.cas.TOP>
    void
    addRule​(java.lang.String selector, java.lang.Class<T> targetClass)
    This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass
    <T extends org.apache.uima.jcas.cas.TOP>
    void
    addRule​(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,​org.jsoup.nodes.Element> callback)
    This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass.
    protected <T extends org.apache.uima.jcas.cas.TOP>
    void
    applyRule​(org.apache.uima.jcas.JCas jcas, org.jsoup.nodes.Element rootElement, java.util.Map<java.lang.String,​XMLElement> annoMap, GenericXmlReader.Rule<T> mapping)  
    boolean exists​(java.lang.String id)
    Checks whether an XML id is defined
    java.util.Map.Entry<org.jsoup.nodes.Element,​org.apache.uima.cas.FeatureStructure> getAnnotation​(java.lang.String id)
    Retrieves an annotation by XML id
    org.jsoup.nodes.Document getDocument()  
    protected <T extends org.apache.uima.jcas.cas.TOP>
    T
    getFeatureStructure​(org.apache.uima.jcas.JCas jcas, XMLElement hAnno, org.jsoup.nodes.Element elm, GenericXmlReader.Rule<T> mapping)  
    java.util.function.Function<org.jsoup.nodes.Element,​java.lang.Boolean> getIgnoreFunction()
    Returns the set ignore function.
    protected static <T extends org.apache.uima.jcas.cas.TOP>
    T
    getOrCreate​(org.apache.uima.jcas.JCas jcas, java.lang.Class<T> targetClass)  
    java.lang.String getTextRootSelector()  
    boolean isPreserveWhitespace()  
    boolean isSkipEmptyElements()  
    org.apache.uima.jcas.JCas read​(java.io.InputStream xmlStream)
    Runs the conversion and executes all rules.
    org.apache.uima.jcas.JCas read​(org.apache.uima.jcas.JCas jcas, java.io.InputStream xmlStream)
    Deprecated. 
    void setIgnoreFunction​(java.util.function.Function<org.jsoup.nodes.Element,​java.lang.Boolean> ignoreFunction)
    The specified function is applied on each element.
    void setPreserveWhitespace​(boolean preserveWhitespace)  
    void setSkipEmptyElements​(boolean skipEmptyElements)  
    void setTextRootSelector​(java.lang.String textRootSelector)  

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • ignoreFunction

      protected java.util.function.Function<org.jsoup.nodes.Element,​java.lang.Boolean> ignoreFunction
  • Constructor Details

  • Method Details

    • read

      public org.apache.uima.jcas.JCas read​(java.io.InputStream xmlStream) throws java.io.IOException, org.apache.uima.UIMAException
      Runs the conversion and executes all rules. Produces a new JCas.
      Returns:
      The populated JCas object
      Throws:
      java.io.IOException - If the input stream errors
      org.apache.uima.UIMAException - If there is an issue with creating the JCas.
    • read

      @Deprecated public org.apache.uima.jcas.JCas read​(org.apache.uima.jcas.JCas jcas, java.io.InputStream xmlStream) throws java.io.IOException
      Deprecated.
      Runs the conversion and executes all rules.
      Returns:
      Throws:
      java.io.IOException - If the input stream errors
    • addRule

      public void addRule​(GenericXmlReader.Rule<?> rule)
    • addRule

      public <T extends org.apache.uima.jcas.cas.TOP> void addRule​(java.lang.String selector, java.lang.Class<T> targetClass)
      This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass
    • addRule

      public <T extends org.apache.uima.jcas.cas.TOP> void addRule​(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,​org.jsoup.nodes.Element> callback)
      This function adds a mapping between elements as expressed in the selector and annotations given by the targetClass. In addition, a function can be defined to do something with the annotation and the element.
    • addGlobalRule

      public void addGlobalRule​(java.lang.String selector, java.util.function.BiConsumer<D,​org.jsoup.nodes.Element> callback)
    • addGlobalRule

      public <T extends org.apache.uima.jcas.cas.TOP> void addGlobalRule​(java.lang.String selector, java.lang.Class<T> targetClass, java.util.function.BiConsumer<T,​org.jsoup.nodes.Element> callback)
    • getAnnotation

      public java.util.Map.Entry<org.jsoup.nodes.Element,​org.apache.uima.cas.FeatureStructure> getAnnotation​(java.lang.String id)
      Retrieves an annotation by XML id
      Returns:
      The feature structure
    • exists

      public boolean exists​(java.lang.String id)
      Checks whether an XML id is defined
      Returns:
      a boolean
    • getFeatureStructure

      protected <T extends org.apache.uima.jcas.cas.TOP> T getFeatureStructure​(org.apache.uima.jcas.JCas jcas, XMLElement hAnno, org.jsoup.nodes.Element elm, GenericXmlReader.Rule<T> mapping)
    • applyRule

      protected <T extends org.apache.uima.jcas.cas.TOP> void applyRule​(org.apache.uima.jcas.JCas jcas, org.jsoup.nodes.Element rootElement, java.util.Map<java.lang.String,​XMLElement> annoMap, GenericXmlReader.Rule<T> mapping)
    • getTextRootSelector

      public java.lang.String getTextRootSelector()
    • setTextRootSelector

      public void setTextRootSelector​(java.lang.String textRootSelector)
    • getDocument

      public org.jsoup.nodes.Document getDocument()
    • isPreserveWhitespace

      public boolean isPreserveWhitespace()
    • setPreserveWhitespace

      public void setPreserveWhitespace​(boolean preserveWhitespace)
    • getOrCreate

      protected static <T extends org.apache.uima.jcas.cas.TOP> T getOrCreate​(org.apache.uima.jcas.JCas jcas, java.lang.Class<T> targetClass)
    • getIgnoreFunction

      public java.util.function.Function<org.jsoup.nodes.Element,​java.lang.Boolean> getIgnoreFunction()
      Returns the set ignore function. @see #setIgnoreFunction(Function) for details.
      Returns:
      The ignore function.
    • setIgnoreFunction

      public void setIgnoreFunction​(java.util.function.Function<org.jsoup.nodes.Element,​java.lang.Boolean> ignoreFunction)
      The specified function is applied on each element. It can be used to skip some XML elements entirely. Skipped elements will not be represented in the JCas at all, and can not be used in rules. The main reason for using this function is to make processing faster if the XML file contains a large number of fine-grained, but unneeded tags.
    • isSkipEmptyElements

      public boolean isSkipEmptyElements()
    • setSkipEmptyElements

      public void setSkipEmptyElements​(boolean skipEmptyElements)