Class POIFSContainerDetector

java.lang.Object
org.apache.tika.detect.microsoft.POIFSContainerDetector
All Implemented Interfaces:
Serializable, org.apache.tika.detect.Detector

public class POIFSContainerDetector extends Object implements org.apache.tika.detect.Detector
A detector that works on a POIFS OLE2 document to figure out exactly what the file is. This should work for all OLE2 documents, whether they are ones supported by POI or not.
See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final org.apache.tika.mime.MediaType
    Some other kind of embedded document, in a CompObj container within another OLE2 document
    static final org.apache.tika.mime.MediaType
     
    static final org.apache.tika.mime.MediaType
    Microsoft Word
    static final org.apache.tika.mime.MediaType
    TIKA-3666 MSOffice or other file encrypted with DRM in an OLE container
    static final org.apache.tika.mime.MediaType
     
    static final org.apache.tika.mime.MediaType
    General embedded document type within an OLE2 container
    static final org.apache.tika.mime.MediaType
    Microsoft Project
    static final org.apache.tika.mime.MediaType
    Equation embedded in Office docs
    static final org.apache.tika.mime.MediaType
    Graph/Charts embedded in PowerPoint and Excel
    static final org.apache.tika.mime.MediaType
    Microsoft Outlook
    static final String
     
    static final org.apache.tika.mime.MediaType
    The OLE base file format
    static final org.apache.tika.mime.MediaType
    An OLE10 Native embedded document within another OLE2 document
    static final org.apache.tika.mime.MediaType
    The protected OOXML base file format
    static final org.apache.tika.mime.MediaType
    Microsoft PowerPoint
    static final org.apache.tika.mime.MediaType
    Microsoft Publisher
    static final org.apache.tika.mime.MediaType
    StarOffice Draw
    static final org.apache.tika.mime.MediaType
    StarOffice Calc
    static final org.apache.tika.mime.MediaType
    StarOffice Impress
    static final org.apache.tika.mime.MediaType
    StarOffice Writer
    static final org.apache.tika.mime.MediaType
    SolidWorks CAD file
    static final org.apache.tika.mime.MediaType
    Microsoft Visio
    static final org.apache.tika.mime.MediaType
    Microsoft Works
    static final org.apache.tika.mime.MediaType
    Microsoft Works Spreadsheet 7.0
    static final org.apache.tika.mime.MediaType
    Microsoft Excel
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    org.apache.tika.mime.MediaType
    detect(InputStream input, org.apache.tika.metadata.Metadata metadata)
     
    static org.apache.tika.mime.MediaType
    detect(Set<String> anyCaseNames, org.apache.poi.poifs.filesystem.DirectoryEntry root)
    Internal detection of the specific kind of OLE2 document, based on the names of the top-level streams within the file.
    void
    setMarkLimit(int markLimit)
    If a TikaInputStream is passed in to detect(InputStream, Metadata), and there is not an underlying file, this detector will spool up to markLimit to disk.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • OLE

      public static final org.apache.tika.mime.MediaType OLE
      The OLE base file format
    • OOXML_PROTECTED

      public static final org.apache.tika.mime.MediaType OOXML_PROTECTED
      The protected OOXML base file format
    • DRM_ENCRYPTED

      public static final org.apache.tika.mime.MediaType DRM_ENCRYPTED
      TIKA-3666 MSOffice or other file encrypted with DRM in an OLE container
    • GENERAL_EMBEDDED

      public static final org.apache.tika.mime.MediaType GENERAL_EMBEDDED
      General embedded document type within an OLE2 container
    • OLE10_NATIVE

      public static final org.apache.tika.mime.MediaType OLE10_NATIVE
      An OLE10 Native embedded document within another OLE2 document
    • COMP_OBJ

      public static final org.apache.tika.mime.MediaType COMP_OBJ
      Some other kind of embedded document, in a CompObj container within another OLE2 document
    • MS_GRAPH_CHART

      public static final org.apache.tika.mime.MediaType MS_GRAPH_CHART
      Graph/Charts embedded in PowerPoint and Excel
    • MS_EQUATION

      public static final org.apache.tika.mime.MediaType MS_EQUATION
      Equation embedded in Office docs
    • OCX_NAME

      public static final String OCX_NAME
      See Also:
    • XLS

      public static final org.apache.tika.mime.MediaType XLS
      Microsoft Excel
    • DOC

      public static final org.apache.tika.mime.MediaType DOC
      Microsoft Word
    • PPT

      public static final org.apache.tika.mime.MediaType PPT
      Microsoft PowerPoint
    • PUB

      public static final org.apache.tika.mime.MediaType PUB
      Microsoft Publisher
    • VSD

      public static final org.apache.tika.mime.MediaType VSD
      Microsoft Visio
    • WPS

      public static final org.apache.tika.mime.MediaType WPS
      Microsoft Works
    • XLR

      public static final org.apache.tika.mime.MediaType XLR
      Microsoft Works Spreadsheet 7.0
    • MSG

      public static final org.apache.tika.mime.MediaType MSG
      Microsoft Outlook
    • MPP

      public static final org.apache.tika.mime.MediaType MPP
      Microsoft Project
    • SDC

      public static final org.apache.tika.mime.MediaType SDC
      StarOffice Calc
    • SDA

      public static final org.apache.tika.mime.MediaType SDA
      StarOffice Draw
    • SDD

      public static final org.apache.tika.mime.MediaType SDD
      StarOffice Impress
    • SDW

      public static final org.apache.tika.mime.MediaType SDW
      StarOffice Writer
    • SLDWORKS

      public static final org.apache.tika.mime.MediaType SLDWORKS
      SolidWorks CAD file
    • ESRI_LAYER

      public static final org.apache.tika.mime.MediaType ESRI_LAYER
    • DGN_8

      public static final org.apache.tika.mime.MediaType DGN_8
  • Constructor Details

    • POIFSContainerDetector

      public POIFSContainerDetector()
  • Method Details

    • detect

      public static org.apache.tika.mime.MediaType detect(Set<String> anyCaseNames, org.apache.poi.poifs.filesystem.DirectoryEntry root)
      Internal detection of the specific kind of OLE2 document, based on the names of the top-level streams within the file. In some cases the detection may need access to the root DirectoryEntry of that file for best results. The entry can be given as a second, optional argument.

      Following 2.6.1 of MS-CFB , The detection is performed on case insensitive entry names.

      Parameters:
      anyCaseNames -
      root -
      Returns:
    • setMarkLimit

      public void setMarkLimit(int markLimit)
      If a TikaInputStream is passed in to detect(InputStream, Metadata), and there is not an underlying file, this detector will spool up to markLimit to disk. If the stream was read in entirety (e.g. the spooled file is not truncated), this detector will open the file with POI and perform detection. If the spooled file is truncated, the detector will return OLE (or MediaType.OCTET_STREAM if there's no OLE header).

      As of Tika 1.21, this detector respects the legacy behavior of not performing detection on a non-TikaInputStream.

      Parameters:
      markLimit -
    • detect

      public org.apache.tika.mime.MediaType detect(InputStream input, org.apache.tika.metadata.Metadata metadata) throws IOException
      Specified by:
      detect in interface org.apache.tika.detect.Detector
      Throws:
      IOException