PDFMarkedContentExtractor (Apache PDFBox 1.7.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.pdfbox.util
Class PDFMarkedContentExtractor

java.lang.Object
  org.apache.pdfbox.util.PDFStreamEngine
      org.apache.pdfbox.util.PDFMarkedContentExtractor

public class PDFMarkedContentExtractor
extends PDFStreamEngine
extends PDFStreamEngine

This is an stream engine to extract the marked content of a pdf.

Version:: $Revision$
Author:: koch

Field Summary
`protected String`	`outputEncoding` encoding that text will be written in (or null).

Constructor Summary
`PDFMarkedContentExtractor()` Instantiate a new PDFTextStripper object.
`PDFMarkedContentExtractor(Properties props)` Instantiate a new PDFTextStripper object.
`PDFMarkedContentExtractor(String encoding)` Instantiate a new PDFTextStripper object.

Method Summary
`void`	`beginMarkedContentSequence(COSName tag, COSDictionary properties)`
`void`	`endMarkedContentSequence()`
`List<PDMarkedContent>`	`getMarkedContents()`
`protected void`	`processTextPosition(TextPosition text)` This will process a TextPosition object and add the text to the list of characters on a page.
`void`	`xobject(PDXObject xobject)`

Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
`getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, inspectFontEncoding, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix`

Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine

getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, inspectFontEncoding, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

outputEncoding

protected String outputEncoding

encoding that text will be written in (or null).

Constructor Detail

PDFMarkedContentExtractor

public PDFMarkedContentExtractor()
                          throws IOException

Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will not do anything special to convert the text to a more encoding-specific output.

Throws:: IOException - If there is an error loading the properties.

PDFMarkedContentExtractor

public PDFMarkedContentExtractor(Properties props)
                          throws IOException

Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.

Parameters:: props - The properties containing the mapping of operators to PDFOperator classes.
Throws:: IOException - If there is an error reading the properties.

PDFMarkedContentExtractor

public PDFMarkedContentExtractor(String encoding)
                          throws IOException

Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will apply encoding-specific conversions to the output text.

Parameters:: encoding - The encoding that the output will be written in.
Throws:: IOException - If there is an error reading the properties.

Method Detail

beginMarkedContentSequence

public void beginMarkedContentSequence(COSName tag,
                                       COSDictionary properties)

endMarkedContentSequence

public void endMarkedContentSequence()

xobject

public void xobject(PDXObject xobject)

processTextPosition

protected void processTextPosition(TextPosition text)

This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.

Overrides:: processTextPosition in class PDFStreamEngine

Parameters:: text - The text to process.

getMarkedContents

public List<PDMarkedContent> getMarkedContents()

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.pdfbox.util Class PDFMarkedContentExtractor

outputEncoding

PDFMarkedContentExtractor

PDFMarkedContentExtractor

PDFMarkedContentExtractor

beginMarkedContentSequence

endMarkedContentSequence

xobject

processTextPosition

getMarkedContents

org.apache.pdfbox.util
Class PDFMarkedContentExtractor