org.apache.poi.extractor
Class ExtractorFactory

java.lang.Object
  extended by org.apache.poi.extractor.ExtractorFactory

public class ExtractorFactory
extends java.lang.Object

Figures out the correct POITextExtractor for your supplied document, and returns it.

Note 1 - will fail for many file formats if the POI Scratchpad jar is not present on the runtime classpath

Note 2 - rather than using this, for most cases you would be better off switching to Apache Tika instead!


Field Summary
static java.lang.String CORE_DOCUMENT_REL
           
protected static java.lang.String STRICT_DOCUMENT_REL
           
protected static java.lang.String VISIO_DOCUMENT_REL
           
 
Constructor Summary
ExtractorFactory()
           
 
Method Summary
static POITextExtractor createExtractor(DirectoryNode poifsDir)
           
static POITextExtractor createExtractor(java.io.File f)
           
static POITextExtractor createExtractor(java.io.InputStream inp)
           
static POIOLE2TextExtractor createExtractor(NPOIFSFileSystem fs)
           
static POIXMLTextExtractor createExtractor(OPCPackage pkg)
          Tries to determine the actual type of file and produces a matching text-extractor for it.
static POIOLE2TextExtractor createExtractor(OPOIFSFileSystem fs)
           
static POIOLE2TextExtractor createExtractor(POIFSFileSystem fs)
           
static java.lang.Boolean getAllThreadsPreferEventExtractors()
          Should all threads prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is to use the thread level setting, which defaults to false.
static POITextExtractor[] getEmbededDocsTextExtractors(POIOLE2TextExtractor ext)
          Returns an array of text extractors, one for each of the embedded documents in the file (if there are any).
static POITextExtractor[] getEmbededDocsTextExtractors(POIXMLTextExtractor ext)
          Returns an array of text extractors, one for each of the embedded documents in the file (if there are any).
protected static boolean getPreferEventExtractor()
          Should this thread use event based extractors is available? Checks the all-threads one first, then thread specific.
static boolean getThreadPrefersEventExtractors()
          Should this thread prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is false.
static void setAllThreadsPreferEventExtractors(java.lang.Boolean preferEventExtractors)
          Should all threads prefer event based over usermodel based extractors? If set, will take preference over the Thread level setting.
static void setThreadPrefersEventExtractors(boolean preferEventExtractors)
          Should this thread prefer event based over usermodel based extractors? Will only be used if the All Threads setting is null.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CORE_DOCUMENT_REL

public static final java.lang.String CORE_DOCUMENT_REL
See Also:
Constant Field Values

VISIO_DOCUMENT_REL

protected static final java.lang.String VISIO_DOCUMENT_REL
See Also:
Constant Field Values

STRICT_DOCUMENT_REL

protected static final java.lang.String STRICT_DOCUMENT_REL
See Also:
Constant Field Values
Constructor Detail

ExtractorFactory

public ExtractorFactory()
Method Detail

getThreadPrefersEventExtractors

public static boolean getThreadPrefersEventExtractors()
Should this thread prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is false.


getAllThreadsPreferEventExtractors

public static java.lang.Boolean getAllThreadsPreferEventExtractors()
Should all threads prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is to use the thread level setting, which defaults to false.


setThreadPrefersEventExtractors

public static void setThreadPrefersEventExtractors(boolean preferEventExtractors)
Should this thread prefer event based over usermodel based extractors? Will only be used if the All Threads setting is null.


setAllThreadsPreferEventExtractors

public static void setAllThreadsPreferEventExtractors(java.lang.Boolean preferEventExtractors)
Should all threads prefer event based over usermodel based extractors? If set, will take preference over the Thread level setting.


getPreferEventExtractor

protected static boolean getPreferEventExtractor()
Should this thread use event based extractors is available? Checks the all-threads one first, then thread specific.


createExtractor

public static POITextExtractor createExtractor(java.io.File f)
                                        throws java.io.IOException,
                                               OpenXML4JException,
                                               org.apache.xmlbeans.XmlException
Throws:
java.io.IOException
OpenXML4JException
org.apache.xmlbeans.XmlException

createExtractor

public static POITextExtractor createExtractor(java.io.InputStream inp)
                                        throws java.io.IOException,
                                               OpenXML4JException,
                                               org.apache.xmlbeans.XmlException
Throws:
java.io.IOException
OpenXML4JException
org.apache.xmlbeans.XmlException

createExtractor

public static POIXMLTextExtractor createExtractor(OPCPackage pkg)
                                           throws java.io.IOException,
                                                  OpenXML4JException,
                                                  org.apache.xmlbeans.XmlException
Tries to determine the actual type of file and produces a matching text-extractor for it.

Parameters:
pkg - An OPCPackage.
Returns:
A POIXMLTextExtractor for the given file.
Throws:
java.io.IOException - If an error occurs while reading the file
OpenXML4JException - If an error parsing the OpenXML file format is found.
org.apache.xmlbeans.XmlException - If an XML parsing error occurs.
java.lang.IllegalArgumentException - If no matching file type could be found.

createExtractor

public static POIOLE2TextExtractor createExtractor(POIFSFileSystem fs)
                                            throws java.io.IOException,
                                                   OpenXML4JException,
                                                   org.apache.xmlbeans.XmlException
Throws:
java.io.IOException
OpenXML4JException
org.apache.xmlbeans.XmlException

createExtractor

public static POIOLE2TextExtractor createExtractor(NPOIFSFileSystem fs)
                                            throws java.io.IOException,
                                                   OpenXML4JException,
                                                   org.apache.xmlbeans.XmlException
Throws:
java.io.IOException
OpenXML4JException
org.apache.xmlbeans.XmlException

createExtractor

public static POIOLE2TextExtractor createExtractor(OPOIFSFileSystem fs)
                                            throws java.io.IOException,
                                                   OpenXML4JException,
                                                   org.apache.xmlbeans.XmlException
Throws:
java.io.IOException
OpenXML4JException
org.apache.xmlbeans.XmlException

createExtractor

public static POITextExtractor createExtractor(DirectoryNode poifsDir)
                                        throws java.io.IOException,
                                               OpenXML4JException,
                                               org.apache.xmlbeans.XmlException
Throws:
java.io.IOException
OpenXML4JException
org.apache.xmlbeans.XmlException

getEmbededDocsTextExtractors

public static POITextExtractor[] getEmbededDocsTextExtractors(POIOLE2TextExtractor ext)
                                                       throws java.io.IOException,
                                                              OpenXML4JException,
                                                              org.apache.xmlbeans.XmlException
Returns an array of text extractors, one for each of the embedded documents in the file (if there are any). If there are no embedded documents, you'll get back an empty array. Otherwise, you'll get one open POITextExtractor for each embedded file.

Throws:
java.io.IOException
OpenXML4JException
org.apache.xmlbeans.XmlException

getEmbededDocsTextExtractors

@NotImplemented
public static POITextExtractor[] getEmbededDocsTextExtractors(POIXMLTextExtractor ext)
Returns an array of text extractors, one for each of the embedded documents in the file (if there are any). If there are no embedded documents, you'll get back an empty array. Otherwise, you'll get one open POITextExtractor for each embedded file.