org.apache.pdfbox.util
Class PDFMarkedContentExtractor

java.lang.Object
  extended by org.apache.pdfbox.util.PDFStreamEngine
      extended by org.apache.pdfbox.util.PDFMarkedContentExtractor

public class PDFMarkedContentExtractor
extends PDFStreamEngine

This is an stream engine to extract the marked content of a pdf.

Version:
$Revision$
Author:
koch

Field Summary
protected  String outputEncoding
          encoding that text will be written in (or null).
 
Constructor Summary
PDFMarkedContentExtractor()
          Instantiate a new PDFTextStripper object.
PDFMarkedContentExtractor(Properties props)
          Instantiate a new PDFTextStripper object.
PDFMarkedContentExtractor(String encoding)
          Instantiate a new PDFTextStripper object.
 
Method Summary
 void beginMarkedContentSequence(COSName tag, COSDictionary properties)
           
 void endMarkedContentSequence()
           
 List<PDMarkedContent> getMarkedContents()
           
protected  void processTextPosition(TextPosition text)
          This will process a TextPosition object and add the text to the list of characters on a page.
 void xobject(PDXObject xobject)
           
 
Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, inspectFontEncoding, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

outputEncoding

protected String outputEncoding
encoding that text will be written in (or null).

Constructor Detail

PDFMarkedContentExtractor

public PDFMarkedContentExtractor()
                          throws IOException
Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will not do anything special to convert the text to a more encoding-specific output.

Throws:
IOException - If there is an error loading the properties.

PDFMarkedContentExtractor

public PDFMarkedContentExtractor(Properties props)
                          throws IOException
Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.

Parameters:
props - The properties containing the mapping of operators to PDFOperator classes.
Throws:
IOException - If there is an error reading the properties.

PDFMarkedContentExtractor

public PDFMarkedContentExtractor(String encoding)
                          throws IOException
Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will apply encoding-specific conversions to the output text.

Parameters:
encoding - The encoding that the output will be written in.
Throws:
IOException - If there is an error reading the properties.
Method Detail

beginMarkedContentSequence

public void beginMarkedContentSequence(COSName tag,
                                       COSDictionary properties)

endMarkedContentSequence

public void endMarkedContentSequence()

xobject

public void xobject(PDXObject xobject)

processTextPosition

protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.

Overrides:
processTextPosition in class PDFStreamEngine
Parameters:
text - The text to process.

getMarkedContents

public List<PDMarkedContent> getMarkedContents()


Copyright © 2002-2012 The Apache Software Foundation. All Rights Reserved.