Class AbstractOfficeParser

java.lang.Object
org.apache.tika.parser.AbstractParser
org.apache.tika.parser.microsoft.AbstractOfficeParser
All Implemented Interfaces:
Serializable, Parser
Direct Known Subclasses:
OfficeParser, OOXMLParser, Word2006MLParser

public abstract class AbstractOfficeParser extends AbstractParser
Intermediate layer to set OfficeParserConfig uniformly.
See Also:
  • Constructor Details

    • AbstractOfficeParser

      public AbstractOfficeParser()
  • Method Details

    • configure

      public void configure(ParseContext parseContext)
      Checks to see if the user has specified an OfficeParserConfig. If so, no changes are made; if not, one is added to the context.
      Parameters:
      parseContext -
    • getIncludeDeletedContent

      public boolean getIncludeDeletedContent()
      Returns:
      See Also:
    • getIncludeMoveFromContent

      public boolean getIncludeMoveFromContent()
      Returns:
      See Also:
    • getUseSAXDocxExtractor

      public boolean getUseSAXDocxExtractor()
      Returns:
      See Also:
    • getExtractMacros

      public boolean getExtractMacros()
      Returns:
      whether or not to extract macros
      See Also:
    • setIncludeDeletedContent

      @Field public void setIncludeDeletedContent(boolean includeDeletedConent)
    • setIncludeMoveFromContent

      @Field public void setIncludeMoveFromContent(boolean includeMoveFromContent)
    • setIncludeShapeBasedContent

      @Field public void setIncludeShapeBasedContent(boolean includeShapeBasedContent)
    • setUseSAXDocxExtractor

      @Field public void setUseSAXDocxExtractor(boolean useSAXDocxExtractor)
    • setUseSAXPptxExtractor

      @Field public void setUseSAXPptxExtractor(boolean useSAXPptxExtractor)
    • setExtractMacros

      @Field public void setExtractMacros(boolean extractMacros)
    • setConcatenatePhoneticRuns

      @Field public void setConcatenatePhoneticRuns(boolean concatenatePhoneticRuns)
    • setExtractAllAlternativesFromMSG

      @Field public void setExtractAllAlternativesFromMSG(boolean extractAllAlternativesFromMSG)
      Some .msg files can contain body content in html, rtf and/or text. The default behavior is to pick the first non-null value and include only that. If you'd like to extract all non-null body content, which is likely duplicative, set this value to true.
      Parameters:
      extractAllAlternativesFromMSG - whether or not to extract all alternative parts from msg files
      Since:
      1.17
    • getExtractAllAlternativesFromMSG

      public boolean getExtractAllAlternativesFromMSG()
    • setByteArrayMaxOverride

      @Field public void setByteArrayMaxOverride(int maxOverride)
      WARNING: this sets a static variable in POI. This allows users to override POI's protection of the allocation of overly large byte arrays. Use carefully; and please open up issues on POI's bugzilla to bump values for specific records.
      Parameters:
      maxOverride -
    • setDateFormatOverride

      @Field public void setDateFormatOverride(String format)