Class ExcelExtractor


  • public class ExcelExtractor
    extends Object
    Excel parser implementation which uses POI's Event API to handle the contents of a Workbook.

    The Event API uses a much smaller memory footprint than HSSFWorkbook when processing excel files but at the cost of more complexity.

    With the Event API a listener is registered for specific record types and those records are created, fired off to the listener and then discarded as the stream is being processed.

    See Also:
    HSSFListener, POI Event API How To
    • Constructor Summary

      Constructors 
      Constructor Description
      ExcelExtractor​(org.apache.tika.parser.ParseContext context, org.apache.tika.metadata.Metadata metadata)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected org.apache.tika.detect.Detector getDetector()  
      protected String getPassword()
      Returns the password to be used for this file, or null if no / default password should be used
      protected org.apache.tika.config.TikaConfig getTikaConfig()  
      protected void handleEmbeddedOfficeDoc​(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
      Handle an office document that's embedded at the POIFS level
      protected void handleEmbeddedOfficeDoc​(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.metadata.Metadata metadata, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
      Handle an office document that's embedded at the POIFS level
      protected void handleEmbeddedOfficeDoc​(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
      Handle an office document that's embedded at the POIFS level
      protected void handleEmbeddedResource​(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)  
      protected void handleEmbeddedResource​(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)  
      protected void handleEmbeddedResource​(org.apache.tika.io.TikaInputStream resource, org.apache.tika.metadata.Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)  
      boolean isListenForAllRecords()
      Returns true if this parser is configured to listen for all records instead of just the specified few.
      protected void parse​(org.apache.poi.poifs.filesystem.DirectoryNode root, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale)  
      protected void parse​(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale)
      Extracts text from an Excel Workbook writing the extracted content to the specified Appendable.
      void setListenForAllRecords​(boolean listenForAllRecords)
      Specifies whether this parser should to listen for all records or just for the specified few.
      static String tryToGetMsgTitle​(org.apache.poi.poifs.filesystem.DirectoryEntry node, String defaultVal)  
    • Field Detail

      • parentMetadata

        protected final org.apache.tika.metadata.Metadata parentMetadata
      • context

        protected final org.apache.tika.parser.ParseContext context
    • Constructor Detail

      • ExcelExtractor

        public ExcelExtractor​(org.apache.tika.parser.ParseContext context,
                              org.apache.tika.metadata.Metadata metadata)
    • Method Detail

      • isListenForAllRecords

        public boolean isListenForAllRecords()
        Returns true if this parser is configured to listen for all records instead of just the specified few.
      • setListenForAllRecords

        public void setListenForAllRecords​(boolean listenForAllRecords)
        Specifies whether this parser should to listen for all records or just for the specified few.

        Note: Under normal operation this setting should be false (the default), but you can experiment with this setting for testing and debugging purposes.

        Parameters:
        listenForAllRecords - true if the HSSFListener should be registered to listen for all records or false if the listener should be configured to only receive specified records.
      • parse

        protected void parse​(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem,
                             org.apache.tika.sax.XHTMLContentHandler xhtml,
                             Locale locale)
                      throws IOException,
                             SAXException,
                             org.apache.tika.exception.TikaException
        Extracts text from an Excel Workbook writing the extracted content to the specified Appendable.
        Parameters:
        filesystem - POI file system
        Throws:
        IOException - if an error occurs processing the workbook or writing the extracted content
        SAXException
        org.apache.tika.exception.TikaException
      • parse

        protected void parse​(org.apache.poi.poifs.filesystem.DirectoryNode root,
                             org.apache.tika.sax.XHTMLContentHandler xhtml,
                             Locale locale)
                      throws IOException,
                             SAXException,
                             org.apache.tika.exception.TikaException
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • getTikaConfig

        protected org.apache.tika.config.TikaConfig getTikaConfig()
      • getDetector

        protected org.apache.tika.detect.Detector getDetector()
      • getPassword

        protected String getPassword()
        Returns the password to be used for this file, or null if no / default password should be used
      • handleEmbeddedResource

        protected void handleEmbeddedResource​(org.apache.tika.io.TikaInputStream resource,
                                              String filename,
                                              String relationshipID,
                                              String mediaType,
                                              org.apache.tika.sax.XHTMLContentHandler xhtml,
                                              boolean outputHtml)
                                       throws IOException,
                                              SAXException,
                                              org.apache.tika.exception.TikaException
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • handleEmbeddedResource

        protected void handleEmbeddedResource​(org.apache.tika.io.TikaInputStream resource,
                                              String filename,
                                              String relationshipID,
                                              org.apache.poi.hpsf.ClassID storageClassID,
                                              String mediaType,
                                              org.apache.tika.sax.XHTMLContentHandler xhtml,
                                              boolean outputHtml)
                                       throws IOException,
                                              SAXException,
                                              org.apache.tika.exception.TikaException
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • handleEmbeddedResource

        protected void handleEmbeddedResource​(org.apache.tika.io.TikaInputStream resource,
                                              org.apache.tika.metadata.Metadata embeddedMetadata,
                                              String filename,
                                              String relationshipID,
                                              org.apache.poi.hpsf.ClassID storageClassID,
                                              String mediaType,
                                              org.apache.tika.sax.XHTMLContentHandler xhtml,
                                              boolean outputHtml)
                                       throws IOException,
                                              SAXException,
                                              org.apache.tika.exception.TikaException
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • handleEmbeddedOfficeDoc

        protected void handleEmbeddedOfficeDoc​(org.apache.poi.poifs.filesystem.DirectoryEntry dir,
                                               org.apache.tika.sax.XHTMLContentHandler xhtml,
                                               boolean outputHtml)
                                        throws IOException,
                                               SAXException,
                                               org.apache.tika.exception.TikaException
        Handle an office document that's embedded at the POIFS level
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • handleEmbeddedOfficeDoc

        protected void handleEmbeddedOfficeDoc​(org.apache.poi.poifs.filesystem.DirectoryEntry dir,
                                               String resourceName,
                                               org.apache.tika.sax.XHTMLContentHandler xhtml,
                                               boolean outputHtml)
                                        throws IOException,
                                               SAXException,
                                               org.apache.tika.exception.TikaException
        Handle an office document that's embedded at the POIFS level
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • handleEmbeddedOfficeDoc

        protected void handleEmbeddedOfficeDoc​(org.apache.poi.poifs.filesystem.DirectoryEntry dir,
                                               org.apache.tika.metadata.Metadata metadata,
                                               String resourceName,
                                               org.apache.tika.sax.XHTMLContentHandler xhtml,
                                               boolean outputHtml)
                                        throws IOException,
                                               SAXException,
                                               org.apache.tika.exception.TikaException
        Handle an office document that's embedded at the POIFS level
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException
      • tryToGetMsgTitle

        public static String tryToGetMsgTitle​(org.apache.poi.poifs.filesystem.DirectoryEntry node,
                                              String defaultVal)