Class WordExtractor

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class WordExtractor
    extends POIOLE2TextExtractor
    Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
    • Constructor Detail

      • WordExtractor

        public WordExtractor​(InputStream is)
                      throws IOException
        Create a new Word Extractor
        Parameters:
        is - InputStream containing the word file
        Throws:
        IOException
      • WordExtractor

        public WordExtractor​(POIFSFileSystem fs)
                      throws IOException
        Create a new Word Extractor
        Parameters:
        fs - POIFSFileSystem containing the word file
        Throws:
        IOException
      • WordExtractor

        public WordExtractor​(HWPFDocument doc)
        Create a new Word Extractor
        Parameters:
        doc - The HWPFDocument to extract from
    • Method Detail

      • main

        public static void main​(String[] args)
                         throws IOException
        Command line extractor, so people will stop moaning that they can't just run this.
        Throws:
        IOException
      • getParagraphText

        public String[] getParagraphText()
        Get the text from the word file, as an array with one String per paragraph
      • getFootnoteText

        public String[] getFootnoteText()
      • getMainTextboxText

        public String[] getMainTextboxText()
      • getEndnoteText

        public String[] getEndnoteText()
      • getCommentsText

        public String[] getCommentsText()
      • getHeaderText

        @Deprecated
        public String getHeaderText()
        Deprecated.
        3.8 beta 4
        Grab the text from the headers
      • getFooterText

        @Deprecated
        public String getFooterText()
        Deprecated.
        3.8 beta 4
        Grab the text from the footers
      • getTextFromPieces

        public String getTextFromPieces()
        Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
      • getText

        public String getText()
        Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().
        Specified by:
        getText in class POITextExtractor
        Returns:
        All the text from the document
      • stripFields

        public static String stripFields​(String text)
        Removes any fields (eg macros, page markers etc) from the string.