Class AbstractOOXMLExtractor

java.lang.Object
org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor
All Implemented Interfaces:
OOXMLExtractor
Direct Known Subclasses:
POIXMLTextExtractorDecorator, SXSLFPowerPointExtractorDecorator, SXWPFWordExtractorDecorator, XPSExtractorDecorator, XSLFPowerPointExtractorDecorator, XSSFExcelExtractorDecorator, XWPFWordExtractorDecorator

public abstract class AbstractOOXMLExtractor extends Object implements OOXMLExtractor
Base class for all Tika OOXML extractors.

Tika extractors decorate POI extractors so that the parsed content of documents is returned as a sequence of XHTML SAX events. Subclasses must implement the buildXHTML method buildXHTML(XHTMLContentHandler) that populates the XHTMLContentHandler object received as parameter.

  • Field Details

    • EMBEDDED_RELATIONSHIPS

      protected static final String[] EMBEDDED_RELATIONSHIPS
    • config

      protected OfficeParserConfig config
    • extractor

      protected org.apache.poi.ooxml.extractor.POIXMLTextExtractor extractor
  • Constructor Details

    • AbstractOOXMLExtractor

      public AbstractOOXMLExtractor(org.apache.tika.parser.ParseContext context, org.apache.poi.ooxml.extractor.POIXMLTextExtractor extractor)
  • Method Details

    • getDocument

      public org.apache.poi.ooxml.POIXMLDocument getDocument()
      Description copied from interface: OOXMLExtractor
      Returns the opened document.
      Specified by:
      getDocument in interface OOXMLExtractor
      See Also:
    • getMetadataExtractor

      public MetadataExtractor getMetadataExtractor()
      Description copied from interface: OOXMLExtractor
      POIXMLTextExtractor.getMetadataTextExtractor() not yet supported for OOXML by POI.
      Specified by:
      getMetadataExtractor in interface OOXMLExtractor
      See Also:
    • getXHTML

      public void getXHTML(ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context) throws SAXException, org.apache.xmlbeans.XmlException, IOException, org.apache.tika.exception.TikaException
      Description copied from interface: OOXMLExtractor
      Parses the document into a sequence of XHTML SAX events sent to the given content handler.
      Specified by:
      getXHTML in interface OOXMLExtractor
      Throws:
      SAXException
      org.apache.xmlbeans.XmlException
      IOException
      org.apache.tika.exception.TikaException
      See Also:
    • getEmbeddedPartMetadataMap

      protected Map<String,EmbeddedPartMetadata> getEmbeddedPartMetadataMap()
    • getJustFileName

      protected String getJustFileName(String desc)
    • handleEmbeddedFile

      protected void handleEmbeddedFile(org.apache.poi.openxml4j.opc.PackagePart part, org.apache.tika.sax.XHTMLContentHandler xhtml, String rel, EmbeddedPartMetadata embeddedPartMetadata, org.apache.tika.metadata.TikaCoreProperties.EmbeddedResourceType embeddedResourceType) throws SAXException, IOException
      Handles an embedded file in the document
      Throws:
      SAXException
      IOException
    • buildXHTML

      protected abstract void buildXHTML(org.apache.tika.sax.XHTMLContentHandler xhtml) throws SAXException, org.apache.xmlbeans.XmlException, IOException
      Populates the XHTMLContentHandler object received as parameter.
      Throws:
      SAXException
      org.apache.xmlbeans.XmlException
      IOException
    • getMainDocumentParts

      protected abstract List<org.apache.poi.openxml4j.opc.PackagePart> getMainDocumentParts() throws org.apache.tika.exception.TikaException
      Return a list of the main parts of the document, used when searching for embedded resources. This should be all the parts of the document that end up with things embedded into them.
      Throws:
      org.apache.tika.exception.TikaException
    • loadLinkedRelationships

      protected Map<String,String> loadLinkedRelationships(org.apache.poi.openxml4j.opc.PackagePart bodyPart, boolean includeInternal, org.apache.tika.metadata.Metadata metadata)
      This is used by the SAX docx and pptx decorators to load hyperlinks and other linked objects
      Parameters:
      bodyPart -
      Returns: