Class ExtractorFactory


  • public final class ExtractorFactory
    extends java.lang.Object
    Figures out the correct POIOLE2TextExtractor for your supplied document, and returns it.

    Note 1 - will fail for many file formats if the POI Scratchpad jar is not present on the runtime classpath

    Note 2 - for text extractor creation across all formats, use POIXMLExtractorFactory contained within the OOXML jar.

    Note 3 - rather than using this, for most cases you would be better off switching to Apache Tika instead!

    • Method Detail

      • getThreadPrefersEventExtractors

        public static boolean getThreadPrefersEventExtractors()
        Should this thread prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is false.
        Returns:
        true if event extractors should be preferred in the current thread, fals otherwise.
      • getAllThreadsPreferEventExtractors

        public static java.lang.Boolean getAllThreadsPreferEventExtractors()
        Should all threads prefer event based over usermodel based extractors? (usermodel extractors tend to be more accurate, but use more memory) Default is to use the thread level setting, which defaults to false.
        Returns:
        true if event extractors should be preferred in all threads, fals otherwise.
      • setThreadPrefersEventExtractors

        public static void setThreadPrefersEventExtractors​(boolean preferEventExtractors)
        Should this thread prefer event based over usermodel based extractors? Will only be used if the All Threads setting is null.
        Parameters:
        preferEventExtractors - If this threads should prefer event based extractors.
      • setAllThreadsPreferEventExtractors

        public static void setAllThreadsPreferEventExtractors​(java.lang.Boolean preferEventExtractors)
        Should all threads prefer event based over usermodel based extractors? If set, will take preference over the Thread level setting.
        Parameters:
        preferEventExtractors - If all threads should prefer event based extractors.
      • getPreferEventExtractor

        public static boolean getPreferEventExtractor()
        Should this thread use event based extractors is available? Checks the all-threads one first, then thread specific.
        Returns:
        If the current thread should use event based extractors.
      • createExtractor

        public static POITextExtractor createExtractor​(POIFSFileSystem fs)
                                                throws java.io.IOException
        Create an extractor that can be used to read text from the given file.
        Parameters:
        fs - The file-system which wraps the data of the file.
        Returns:
        A POITextExtractor that can be used to fetch text-content of the file.
        Throws:
        java.io.IOException - If reading the file-data fails
      • createExtractor

        public static POITextExtractor createExtractor​(POIFSFileSystem fs,
                                                       java.lang.String password)
                                                throws java.io.IOException
        Create an extractor that can be used to read text from the given file.
        Parameters:
        fs - The file-system which wraps the data of the file.
        password - The password that is necessary to open the file
        Returns:
        A POITextExtractor that can be used to fetch text-content of the file.
        Throws:
        java.io.IOException - If reading the file-data fails
      • createExtractor

        public static POITextExtractor createExtractor​(java.io.InputStream input)
                                                throws java.io.IOException
        Create an extractor that can be used to read text from the given file.
        Parameters:
        input - A stream which wraps the data of the file.
        Returns:
        A POITextExtractor that can be used to fetch text-content of the file.
        Throws:
        java.io.IOException - If reading the file-data fails
        EmptyFileException - If the given file is empty
      • createExtractor

        public static POITextExtractor createExtractor​(java.io.InputStream input,
                                                       java.lang.String password)
                                                throws java.io.IOException
        Create an extractor that can be used to read text from the given file.
        Parameters:
        input - A stream which wraps the data of the file.
        password - The password that is necessary to open the file
        Returns:
        A POITextExtractor that can be used to fetch text-content of the file.
        Throws:
        java.io.IOException - If reading the file-data fails
        EmptyFileException - If the given file is empty
      • createExtractor

        public static POITextExtractor createExtractor​(java.io.File file)
                                                throws java.io.IOException
        Create an extractor that can be used to read text from the given file.
        Parameters:
        file - The file to read
        Returns:
        A POITextExtractor that can be used to fetch text-content of the file.
        Throws:
        java.io.IOException - If reading the file-data fails
        EmptyFileException - If the given file is empty
      • createExtractor

        public static POITextExtractor createExtractor​(java.io.File file,
                                                       java.lang.String password)
                                                throws java.io.IOException
        Create an extractor that can be used to read text from the given file.
        Parameters:
        file - The file to read
        password - The password that is necessary to open the file
        Returns:
        A POITextExtractor that can be used to fetch text-content of the file.
        Throws:
        java.io.IOException - If reading the file-data fails
        EmptyFileException - If the given file is empty
      • createExtractor

        public static POITextExtractor createExtractor​(DirectoryNode root)
                                                throws java.io.IOException
        Create the Extractor, if possible. Generally needs the Scratchpad jar. Note that this won't check for embedded OOXML resources either, use POIXMLExtractorFactory for that.
        Parameters:
        root - The DirectoryNode pointing to a document.
        Returns:
        The resulting POITextExtractor, an exception is thrown if no TextExtractor can be created for some reason.
        Throws:
        java.io.IOException - If converting the DirectoryNode into a HSSFWorkbook fails
        OldFileFormatException - If the DirectoryNode points to a format of an unsupported version of Excel.
        java.lang.IllegalArgumentException - If creating the Extractor fails
      • createExtractor

        public static POITextExtractor createExtractor​(DirectoryNode root,
                                                       java.lang.String password)
                                                throws java.io.IOException
        Create the Extractor, if possible. Generally needs the Scratchpad jar. Note that this won't check for embedded OOXML resources either, use POIXMLExtractorFactory for that.
        Parameters:
        root - The DirectoryNode pointing to a document.
        password - The password that is necessary to open the file
        Returns:
        The resulting POITextExtractor, an exception is thrown if no TextExtractor can be created for some reason.
        Throws:
        java.io.IOException - If converting the DirectoryNode into a HSSFWorkbook fails
        OldFileFormatException - If the DirectoryNode points to a format of an unsupported version of Excel.
        java.lang.IllegalArgumentException - If creating the Extractor fails
      • getEmbeddedDocsTextExtractors

        public static POITextExtractor[] getEmbeddedDocsTextExtractors​(POIOLE2TextExtractor ext)
                                                                throws java.io.IOException
        Returns an array of text extractors, one for each of the embedded documents in the file (if there are any). If there are no embedded documents, you'll get back an empty array. Otherwise, you'll get one open POITextExtractor for each embedded file.
        Parameters:
        ext - The extractor to look at for embedded documents
        Returns:
        An array of resulting extractors. Empty if no embedded documents are found.
        Throws:
        java.io.IOException - If converting the DirectoryNode into a HSSFWorkbook fails
        OldFileFormatException - If the DirectoryNode points to a format of an unsupported version of Excel.
        java.lang.IllegalArgumentException - If creating the Extractor fails
      • removeProvider

        public static void removeProvider​(java.lang.Class<? extends ExtractorProvider> provider)