Package org.apache.tika.parser.microsoft
Class OfficeParser
java.lang.Object
org.apache.tika.parser.microsoft.AbstractOfficeParser
org.apache.tika.parser.microsoft.OfficeParser
- All Implemented Interfaces:
Serializable,org.apache.tika.parser.Parser
Defines a Microsoft document content extractor.
- See Also:
-
Nested Class Summary
Nested Classes -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidextractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, org.apache.tika.extractor.EmbeddedDocumentExtractor embeddedDocumentExtractor) Helper to extract macros from an NPOIFS/vbaProject.binSet<org.apache.tika.mime.MediaType>getSupportedTypes(org.apache.tika.parser.ParseContext context) static org.apache.poi.poifs.filesystem.EntrygetUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget) Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTargetvoidparse(InputStream stream, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context) Extracts properties and text from an MS Document input streamprotected voidparse(org.apache.poi.poifs.filesystem.DirectoryNode root, org.apache.tika.parser.ParseContext context, org.apache.tika.metadata.Metadata metadata, org.apache.tika.sax.XHTMLContentHandler xhtml) Methods inherited from class org.apache.tika.parser.microsoft.AbstractOfficeParser
configure, getByteArrayMaxOverride, getDateFormatOverride, isConcatenatePhoneticRuns, isExtractAllAlternativesFromMSG, isExtractMacros, isIncludeDeletedContent, isIncludeHeadersAndFooters, isIncludeMoveFromContent, isIncludeShapeBasedContent, isUseSAXDocxExtractor, isUseSAXPptxExtractor, setByteArrayMaxOverride, setConcatenatePhoneticRuns, setDateFormatOverride, setExtractAllAlternativesFromMSG, setExtractMacros, setIncludeDeletedContent, setIncludeHeadersAndFooters, setIncludeMoveFromContent, setIncludeShapeBasedContent, setUseSAXDocxExtractor, setUseSAXPptxExtractor
-
Constructor Details
-
OfficeParser
public OfficeParser()
-
-
Method Details
-
extractMacros
public static void extractMacros(org.apache.poi.poifs.filesystem.POIFSFileSystem fs, ContentHandler xhtml, org.apache.tika.extractor.EmbeddedDocumentExtractor embeddedDocumentExtractor) throws IOException, SAXException Helper to extract macros from an NPOIFS/vbaProject.binAs of POI-3.15-final, there are still some bugs in VBAMacroReader. For now, we are swallowing NPE and other runtime exceptions
- Parameters:
fs- NPOIFS to extract fromxhtml- SAX writerembeddedDocumentExtractor- extractor for embedded documents- Throws:
IOException- on IOException if it occurs during the extraction of the embedded docSAXException- on SAXException for writing to xhtml
-
getSupportedTypes
public Set<org.apache.tika.mime.MediaType> getSupportedTypes(org.apache.tika.parser.ParseContext context) -
parse
public void parse(InputStream stream, ContentHandler handler, org.apache.tika.metadata.Metadata metadata, org.apache.tika.parser.ParseContext context) throws IOException, SAXException, org.apache.tika.exception.TikaException Extracts properties and text from an MS Document input stream- Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
parse
protected void parse(org.apache.poi.poifs.filesystem.DirectoryNode root, org.apache.tika.parser.ParseContext context, org.apache.tika.metadata.Metadata metadata, org.apache.tika.sax.XHTMLContentHandler xhtml) throws IOException, SAXException, org.apache.tika.exception.TikaException - Throws:
IOExceptionSAXExceptionorg.apache.tika.exception.TikaException
-
getUCEntry
public static org.apache.poi.poifs.filesystem.Entry getUCEntry(org.apache.poi.poifs.filesystem.DirectoryEntry root, String ucTarget) Looks for entry within root (non-recursive) that has an upper-cased name that equals ucTarget- Parameters:
root-ucTarget-- Returns:
-