Package org.apache.poi.hwpf.extractor
Class WordExtractor
- java.lang.Object
-
- org.apache.poi.extractor.POITextExtractor
-
- org.apache.poi.extractor.POIOLE2TextExtractor
-
- org.apache.poi.hwpf.extractor.WordExtractor
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public final class WordExtractor extends POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
-
-
Constructor Summary
Constructors Constructor Description WordExtractor(InputStream is)
Create a new Word ExtractorWordExtractor(HWPFDocument doc)
Create a new Word ExtractorWordExtractor(DirectoryNode dir)
WordExtractor(POIFSFileSystem fs)
Create a new Word Extractor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description String[]
getCommentsText()
String[]
getEndnoteText()
String
getFooterText()
Deprecated.3.8 beta 4String[]
getFootnoteText()
String
getHeaderText()
Deprecated.3.8 beta 4String[]
getMainTextboxText()
String[]
getParagraphText()
Get the text from the word file, as an array with one String per paragraphString
getText()
Grab the text, based on the WordToTextConverter.String
getTextFromPieces()
Grab the text out of the text pieces.static void
main(String[] args)
Command line extractor, so people will stop moaning that they can't just run this.static String
stripFields(String text)
Removes any fields (eg macros, page markers etc) from the string.-
Methods inherited from class org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getDocument, getMetadataTextExtractor, getRoot, getSummaryInformation
-
Methods inherited from class org.apache.poi.extractor.POITextExtractor
close, setFilesystem
-
-
-
-
Constructor Detail
-
WordExtractor
public WordExtractor(InputStream is) throws IOException
Create a new Word Extractor- Parameters:
is
- InputStream containing the word file- Throws:
IOException
-
WordExtractor
public WordExtractor(POIFSFileSystem fs) throws IOException
Create a new Word Extractor- Parameters:
fs
- POIFSFileSystem containing the word file- Throws:
IOException
-
WordExtractor
public WordExtractor(DirectoryNode dir) throws IOException
- Throws:
IOException
-
WordExtractor
public WordExtractor(HWPFDocument doc)
Create a new Word Extractor- Parameters:
doc
- The HWPFDocument to extract from
-
-
Method Detail
-
main
public static void main(String[] args) throws IOException
Command line extractor, so people will stop moaning that they can't just run this.- Throws:
IOException
-
getParagraphText
public String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph
-
getFootnoteText
public String[] getFootnoteText()
-
getMainTextboxText
public String[] getMainTextboxText()
-
getEndnoteText
public String[] getEndnoteText()
-
getCommentsText
public String[] getCommentsText()
-
getHeaderText
@Deprecated public String getHeaderText()
Deprecated.3.8 beta 4Grab the text from the headers
-
getFooterText
@Deprecated public String getFooterText()
Deprecated.3.8 beta 4Grab the text from the footers
-
getTextFromPieces
public String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
-
getText
public String getText()
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
getText
in classPOITextExtractor
- Returns:
- All the text from the document
-
-