public class PDFTextStripperByArea extends PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
Constructor and Description |
---|
PDFTextStripperByArea()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
addRegion(String regionName,
Rectangle2D rect)
Add a new region to group text by.
|
protected float |
computeFontHeight(PDFont font)
Compute the font height.
|
void |
extractRegions(PDPage page)
Process the page to extract the region text.
|
List<String> |
getRegions()
Get the list of regions that have been setup.
|
String |
getTextForRegion(String regionName)
Get the text for the region, this should be called after extractRegions().
|
protected void |
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page.
|
void |
removeRegion(String regionName)
Delete a region to group text by.
|
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
This method does nothing in this derived class, because beads and regions are incompatible.
|
protected void |
showGlyph(Matrix textRenderingMatrix,
PDFont font,
int code,
Vector displacement)
Called when a glyph is to be processed.
|
protected void |
writePage()
This will print the processed page text to the output stream.
|
beginMarkedContentSequence, endArticle, endDocument, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, markedContentPoint, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
public PDFTextStripperByArea() throws IOException
IOException
- If there is an error loading properties.public final void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
setShouldSeparateByBeads
in class PDFTextStripper
aShouldSeparateByBeads
- The new grouping of beads.public void addRegion(String regionName, Rectangle2D rect)
regionName
- The name of the region.rect
- The rectangle area to retrieve the text from. The y-coordinates are java
coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).public void removeRegion(String regionName)
regionName
- The name of the region to delete.public List<String> getRegions()
public String getTextForRegion(String regionName)
regionName
- The name of the region to get the text from.public void extractRegions(PDPage page) throws IOException
page
- The page to extract the regions from.IOException
- If there is an error while extracting text.protected void processTextPosition(TextPosition text)
processTextPosition
in class PDFTextStripper
text
- The text to process.protected void writePage() throws IOException
writePage
in class PDFTextStripper
IOException
- If there is an error writing the text.protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement) throws IOException
showGlyph
in class PDFStreamEngine
textRenderingMatrix
- the current text rendering matrix, Trmfont
- the current fontcode
- internal PDF character code for the glyphdisplacement
- the displacement (i.e. advance) of the glyph in text spaceIOException
- if the glyph cannot be processedprotected float computeFontHeight(PDFont font) throws IOException
font
- the font.IOException
- if there is an error while getting the font bounding box.Copyright © 2002–2025 The Apache Software Foundation. All rights reserved.