Package org.apache.poi.hwpf.extractor
Class WordExtractor
- java.lang.Object
-
- org.apache.poi.hwpf.extractor.WordExtractor
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,POIOLE2TextExtractor
,POITextExtractor
public final class WordExtractor extends java.lang.Object implements POIOLE2TextExtractor
Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.
-
-
Constructor Summary
Constructors Constructor Description WordExtractor(java.io.InputStream is)
Create a new Word ExtractorWordExtractor(HWPFDocument doc)
Create a new Word ExtractorWordExtractor(DirectoryNode dir)
WordExtractor(POIFSFileSystem fs)
Create a new Word Extractor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description java.lang.String[]
getCommentsText()
HWPFDocument
getDocument()
Return the underlying POIDocumentjava.lang.String[]
getEndnoteText()
HWPFDocument
getFilesystem()
java.lang.String
getFooterText()
Deprecated.3.8 beta 4java.lang.String[]
getFootnoteText()
java.lang.String
getHeaderText()
Deprecated.3.8 beta 4java.lang.String[]
getMainTextboxText()
java.lang.String[]
getParagraphText()
Get the text from the word file, as an array with one String per paragraphjava.lang.String
getText()
Grab the text, based on the WordToTextConverter.java.lang.String
getTextFromPieces()
Grab the text out of the text pieces.boolean
isCloseFilesystem()
void
setCloseFilesystem(boolean doCloseFilesystem)
static java.lang.String
stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.-
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.poi.extractor.POIOLE2TextExtractor
getDocSummaryInformation, getMetadataTextExtractor, getRoot, getSummaryInformation
-
Methods inherited from interface org.apache.poi.extractor.POITextExtractor
close
-
-
-
-
Constructor Detail
-
WordExtractor
public WordExtractor(java.io.InputStream is) throws java.io.IOException
Create a new Word Extractor- Parameters:
is
- InputStream containing the word file- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(POIFSFileSystem fs) throws java.io.IOException
Create a new Word Extractor- Parameters:
fs
- POIFSFileSystem containing the word file- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(DirectoryNode dir) throws java.io.IOException
- Throws:
java.io.IOException
-
WordExtractor
public WordExtractor(HWPFDocument doc)
Create a new Word Extractor- Parameters:
doc
- The HWPFDocument to extract from
-
-
Method Detail
-
getParagraphText
public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph
-
getFootnoteText
public java.lang.String[] getFootnoteText()
-
getMainTextboxText
public java.lang.String[] getMainTextboxText()
-
getEndnoteText
public java.lang.String[] getEndnoteText()
-
getCommentsText
public java.lang.String[] getCommentsText()
-
getHeaderText
@Deprecated public java.lang.String getHeaderText()
Deprecated.3.8 beta 4Grab the text from the headers
-
getFooterText
@Deprecated public java.lang.String getFooterText()
Deprecated.3.8 beta 4Grab the text from the footers
-
getTextFromPieces
public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.
-
getText
public java.lang.String getText()
Grab the text, based on the WordToTextConverter. Shouldn't include any crud, but slower than getTextFromPieces().- Specified by:
getText
in interfacePOITextExtractor
- Returns:
- All the text from the document
-
stripFields
public static java.lang.String stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.
-
getDocument
public HWPFDocument getDocument()
Description copied from interface:POIOLE2TextExtractor
Return the underlying POIDocument- Specified by:
getDocument
in interfacePOIOLE2TextExtractor
- Specified by:
getDocument
in interfacePOITextExtractor
- Returns:
- the underlying POIDocument
-
setCloseFilesystem
public void setCloseFilesystem(boolean doCloseFilesystem)
- Specified by:
setCloseFilesystem
in interfacePOITextExtractor
- Parameters:
doCloseFilesystem
-true
(default), if underlying resources/filesystem should be closed onPOITextExtractor.close()
-
isCloseFilesystem
public boolean isCloseFilesystem()
- Specified by:
isCloseFilesystem
in interfacePOITextExtractor
- Returns:
true
, if resources/filesystem should be closed onPOITextExtractor.close()
-
getFilesystem
public HWPFDocument getFilesystem()
- Specified by:
getFilesystem
in interfacePOITextExtractor
- Returns:
- The underlying resources/filesystem
-
-