Class ExcelExtractor

java.lang.Object
org.apache.tika.parser.microsoft.ExcelExtractor

public class ExcelExtractor extends Object
Excel parser implementation which uses POI's Event API to handle the contents of a Workbook.

The Event API uses a much smaller memory footprint than HSSFWorkbook when processing excel files but at the cost of more complexity.

With the Event API a listener is registered for specific record types and those records are created, fired off to the listener and then discarded as the stream is being processed.

See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected final org.apache.tika.parser.ParseContext
     
    protected final OfficeParserConfig
     
    protected final org.apache.tika.metadata.Metadata
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    ExcelExtractor(org.apache.tika.parser.ParseContext context, org.apache.tika.metadata.Metadata metadata)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    protected org.apache.tika.detect.Detector
     
    protected String
    Returns the password to be used for this file, or null if no / default password should be used
    protected org.apache.tika.config.TikaConfig
     
    protected void
    handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
    Handle an office document that's embedded at the POIFS level
    protected void
    handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
    Handle an office document that's embedded at the POIFS level
    protected void
    handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
     
    protected void
    handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
     
    protected void
    handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, org.apache.tika.metadata.Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)
     
    boolean
    Returns true if this parser is configured to listen for all records instead of just the specified few.
    protected void
    parse(org.apache.poi.poifs.filesystem.DirectoryNode root, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale)
     
    protected void
    parse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale)
    Extracts text from an Excel Workbook writing the extracted content to the specified Appendable.
    void
    setListenForAllRecords(boolean listenForAllRecords)
    Specifies whether this parser should to listen for all records or just for the specified few.
    static String
    tryToGetMsgTitle(org.apache.poi.poifs.filesystem.DirectoryEntry node, String defaultVal)
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • parentMetadata

      protected final org.apache.tika.metadata.Metadata parentMetadata
    • officeParserConfig

      protected final OfficeParserConfig officeParserConfig
    • context

      protected final org.apache.tika.parser.ParseContext context
  • Constructor Details

    • ExcelExtractor

      public ExcelExtractor(org.apache.tika.parser.ParseContext context, org.apache.tika.metadata.Metadata metadata)
  • Method Details

    • isListenForAllRecords

      public boolean isListenForAllRecords()
      Returns true if this parser is configured to listen for all records instead of just the specified few.
    • setListenForAllRecords

      public void setListenForAllRecords(boolean listenForAllRecords)
      Specifies whether this parser should to listen for all records or just for the specified few.

      Note: Under normal operation this setting should be false (the default), but you can experiment with this setting for testing and debugging purposes.

      Parameters:
      listenForAllRecords - true if the HSSFListener should be registered to listen for all records or false if the listener should be configured to only receive specified records.
    • parse

      protected void parse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Extracts text from an Excel Workbook writing the extracted content to the specified Appendable.
      Parameters:
      filesystem - POI file system
      Throws:
      IOException - if an error occurs processing the workbook or writing the extracted content
      SAXException
      org.apache.tika.exception.TikaException
    • parse

      protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Throws:
      IOException
      SAXException
      org.apache.tika.exception.TikaException
    • getTikaConfig

      protected org.apache.tika.config.TikaConfig getTikaConfig()
    • getDetector

      protected org.apache.tika.detect.Detector getDetector()
    • getPassword

      protected String getPassword()
      Returns the password to be used for this file, or null if no / default password should be used
    • handleEmbeddedResource

      protected void handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Throws:
      IOException
      SAXException
      org.apache.tika.exception.TikaException
    • handleEmbeddedResource

      protected void handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Throws:
      IOException
      SAXException
      org.apache.tika.exception.TikaException
    • handleEmbeddedResource

      protected void handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, org.apache.tika.metadata.Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Throws:
      IOException
      SAXException
      org.apache.tika.exception.TikaException
    • handleEmbeddedOfficeDoc

      protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Handle an office document that's embedded at the POIFS level
      Throws:
      IOException
      SAXException
      org.apache.tika.exception.TikaException
    • handleEmbeddedOfficeDoc

      protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException
      Handle an office document that's embedded at the POIFS level
      Throws:
      IOException
      SAXException
      org.apache.tika.exception.TikaException
    • tryToGetMsgTitle

      public static String tryToGetMsgTitle(org.apache.poi.poifs.filesystem.DirectoryEntry node, String defaultVal)