Class AbstractOOXMLExtractor

    • Field Detail

      • EMBEDDED_RELATIONSHIPS

        protected static final String[] EMBEDDED_RELATIONSHIPS
      • extractor

        protected org.apache.poi.ooxml.extractor.POIXMLTextExtractor extractor
    • Constructor Detail

      • AbstractOOXMLExtractor

        public AbstractOOXMLExtractor​(org.apache.tika.parser.ParseContext context,
                                      org.apache.poi.ooxml.extractor.POIXMLTextExtractor extractor)
    • Method Detail

      • getJustFileName

        protected String getJustFileName​(String desc)
      • handleEmbeddedFile

        protected void handleEmbeddedFile​(org.apache.poi.openxml4j.opc.PackagePart part,
                                          org.apache.tika.sax.XHTMLContentHandler xhtml,
                                          String rel,
                                          EmbeddedPartMetadata embeddedPartMetadata,
                                          org.apache.tika.metadata.TikaCoreProperties.EmbeddedResourceType embeddedResourceType)
                                   throws SAXException,
                                          IOException
        Handles an embedded file in the document
        Throws:
        SAXException
        IOException
      • buildXHTML

        protected abstract void buildXHTML​(org.apache.tika.sax.XHTMLContentHandler xhtml)
                                    throws SAXException,
                                           org.apache.xmlbeans.XmlException,
                                           IOException
        Populates the XHTMLContentHandler object received as parameter.
        Throws:
        SAXException
        org.apache.xmlbeans.XmlException
        IOException
      • getMainDocumentParts

        protected abstract List<org.apache.poi.openxml4j.opc.PackagePart> getMainDocumentParts()
                                                                                        throws org.apache.tika.exception.TikaException
        Return a list of the main parts of the document, used when searching for embedded resources. This should be all the parts of the document that end up with things embedded into them.
        Throws:
        org.apache.tika.exception.TikaException
      • loadLinkedRelationships

        protected Map<String,​String> loadLinkedRelationships​(org.apache.poi.openxml4j.opc.PackagePart bodyPart,
                                                                   boolean includeInternal,
                                                                   org.apache.tika.metadata.Metadata metadata)
        This is used by the SAX docx and pptx decorators to load hyperlinks and other linked objects
        Parameters:
        bodyPart -
        Returns: