Package org.apache.tika.parser.microsoft
Class ExcelExtractor
- java.lang.Object
-
- org.apache.tika.parser.microsoft.ExcelExtractor
-
public class ExcelExtractor extends Object
Excel parser implementation which uses POI's Event API to handle the contents of a Workbook. The Event API uses a much smaller memory footprint thanHSSFWorkbookwhen processing excel files but at the cost of more complexity. With the Event API a listener is registered for specific record types and those records are created, fired off to the listener and then discarded as the stream is being processed.- See Also:
HSSFListener, POI Event API How To
-
-
Field Summary
Fields Modifier and Type Field Description protected org.apache.tika.parser.ParseContextcontextprotected OfficeParserConfigofficeParserConfigprotected org.apache.tika.metadata.MetadataparentMetadata
-
Constructor Summary
Constructors Constructor Description ExcelExtractor(org.apache.tika.parser.ParseContext context, org.apache.tika.metadata.Metadata metadata)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.tika.detect.DetectorgetDetector()protected StringgetPassword()Returns the password to be used for this file, or null if no / default password should be usedprotected org.apache.tika.config.TikaConfiggetTikaConfig()protected voidhandleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)Handle an office document that's embedded at the POIFS levelprotected voidhandleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.metadata.Metadata metadata, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)Handle an office document that's embedded at the POIFS levelprotected voidhandleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)Handle an office document that's embedded at the POIFS levelprotected voidhandleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)protected voidhandleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)protected voidhandleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, org.apache.tika.metadata.Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml)booleanisListenForAllRecords()Returnstrueif this parser is configured to listen for all records instead of just the specified few.protected voidparse(org.apache.poi.poifs.filesystem.DirectoryNode root, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale)protected voidparse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale)Extracts text from an Excel Workbook writing the extracted content to the specifiedAppendable.voidsetListenForAllRecords(boolean listenForAllRecords)Specifies whether this parser should to listen for all records or just for the specified few.static StringtryToGetMsgTitle(org.apache.poi.poifs.filesystem.DirectoryEntry node, String defaultVal)
-
-
-
Field Detail
-
parentMetadata
protected final org.apache.tika.metadata.Metadata parentMetadata
-
officeParserConfig
protected final OfficeParserConfig officeParserConfig
-
context
protected final org.apache.tika.parser.ParseContext context
-
-
Method Detail
-
isListenForAllRecords
public boolean isListenForAllRecords()
Returnstrueif this parser is configured to listen for all records instead of just the specified few.
-
setListenForAllRecords
public void setListenForAllRecords(boolean listenForAllRecords)
Specifies whether this parser should to listen for all records or just for the specified few. Note: Under normal operation this setting should befalse(the default), but you can experiment with this setting for testing and debugging purposes.- Parameters:
listenForAllRecords-trueif the HSSFListener should be registered to listen for all records orfalseif the listener should be configured to only receive specified records.
-
parse
protected void parse(org.apache.poi.poifs.filesystem.POIFSFileSystem filesystem, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, org.apache.tika.exception.TikaExceptionExtracts text from an Excel Workbook writing the extracted content to the specifiedAppendable.- Parameters:
filesystem- POI file system- Throws:
IOException- if an error occurs processing the workbook or writing the extracted contentSAXExceptionorg.apache.tika.exception.TikaException
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, org.apache.tika.sax.XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, org.apache.tika.exception.TikaException- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
getTikaConfig
protected org.apache.tika.config.TikaConfig getTikaConfig()
-
getDetector
protected org.apache.tika.detect.Detector getDetector()
-
getPassword
protected String getPassword()
Returns the password to be used for this file, or null if no / default password should be used
-
handleEmbeddedResource
protected void handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
handleEmbeddedResource
protected void handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
handleEmbeddedResource
protected void handleEmbeddedResource(org.apache.tika.io.TikaInputStream resource, org.apache.tika.metadata.Metadata embeddedMetadata, String filename, String relationshipID, org.apache.poi.hpsf.ClassID storageClassID, String mediaType, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaException- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaExceptionHandle an office document that's embedded at the POIFS level- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaExceptionHandle an office document that's embedded at the POIFS level- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
handleEmbeddedOfficeDoc
protected void handleEmbeddedOfficeDoc(org.apache.poi.poifs.filesystem.DirectoryEntry dir, org.apache.tika.metadata.Metadata metadata, String resourceName, org.apache.tika.sax.XHTMLContentHandler xhtml, boolean outputHtml) throws IOException, SAXException, org.apache.tika.exception.TikaExceptionHandle an office document that's embedded at the POIFS level- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
-