org.apache.pdfbox.pdfparser
Class ConformingPDFParser

java.lang.Object
  extended by org.apache.pdfbox.pdfparser.BaseParser
      extended by org.apache.pdfbox.pdfparser.ConformingPDFParser

public class ConformingPDFParser
extends BaseParser

Author:
Adam Nichols

Field Summary
protected  RandomAccess inputFile
           
 
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, FORCE_PARSING, forceParsing, pdfSource
 
Constructor Summary
ConformingPDFParser(File inputFile)
          Constructor.
 
Method Summary
protected  byte consumeWhitespace()
          This will read all bytes until a non-whitespace character is found.
protected  byte consumeWhitespaceBackwards()
          This will read all bytes (backwards) until a non-whitespace character is found.
 COSDocument getDocument()
          This will get the document that was parsed.
 COSBase getObject(long objectNumber, long generation)
           
 PDDocument getPDDocument()
          This will get the PD document that was parsed.
 boolean isRecursivlyRead()
           
 void parse()
          This will parse the stream and populate the COSDocument object.
protected  COSNumber parseNumber(String number)
           
protected  long parseTrailerInformation()
           
protected  COSBase processCosObject(String string)
           
protected  String readBackwardUntilWhitespace()
           
protected  byte readByte()
           
protected  byte readByteBackwards()
           
protected  COSDictionary readDictionaryBackwards()
           
protected  int readInt()
          This will read an integer from the stream.
protected  String readLine()
          This will read a line starting with the byte at offset and going forward until it finds a newline.
protected  String readLineBackwards()
          This will read a line starting with the byte at offset and going backwards until it finds a newline.
protected  long readLongBackwards()
          This will consume any whitespace, read in bytes until whitespace is found again and then parse the characters which have been read as a long.
protected  COSName readNameBackwards()
           
protected  COSNumber readNumber()
          This will read in a number and return the COS version of the number (be it a COSInteger or a COSFloat).
protected  COSBase readObject()
          This actually reads the object data.
 COSBase readObject(long objectNumber, long generation)
          This will read an object from the inputFile at whatever our currentOffset is.
protected  COSBase readObjectBackwards()
           
protected  String readString()
          This will read the next string from the stream.
protected  String readWord()
           
 void setRecursivlyRead(boolean recursivlyRead)
           
 
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSStream, parseCOSString, parseDirObject, readExpectedString, readString, setDocument, skipSpaces
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

inputFile

protected RandomAccess inputFile
Constructor Detail

ConformingPDFParser

public ConformingPDFParser(File inputFile)
                    throws IOException
Constructor.

Parameters:
input - The input stream that contains the PDF document.
Throws:
IOException - If there is an error initializing the stream.
Method Detail

parse

public void parse()
           throws IOException
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.

Throws:
IOException - If there is an error reading from the stream or corrupt data is found.

getDocument

public COSDocument getDocument()
                        throws IOException
This will get the document that was parsed. parse() must be called before this is called. When you are done with this document you must call close() on it to release resources.

Returns:
The document that was parsed.
Throws:
IOException - If there is an error getting the document.

getPDDocument

public PDDocument getPDDocument()
                         throws IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources.

Returns:
The document at the PD layer.
Throws:
IOException - If there is an error getting the document.

parseTrailerInformation

protected long parseTrailerInformation()
                                throws IOException,
                                       NumberFormatException
Throws:
IOException
NumberFormatException

readByteBackwards

protected byte readByteBackwards()
                          throws IOException
Throws:
IOException

readByte

protected byte readByte()
                 throws IOException
Throws:
IOException

readBackwardUntilWhitespace

protected String readBackwardUntilWhitespace()
                                      throws IOException
Throws:
IOException

consumeWhitespaceBackwards

protected byte consumeWhitespaceBackwards()
                                   throws IOException
This will read all bytes (backwards) until a non-whitespace character is found. To save you an extra read, the non-whitespace character is returned. If the current character is not whitespace, this method will just return the current char.

Returns:
the first non-whitespace character found
Throws:
IOException - if there is an error reading from the file

consumeWhitespace

protected byte consumeWhitespace()
                          throws IOException
This will read all bytes until a non-whitespace character is found. To save you an extra read, the non-whitespace character is returned. If the current character is not whitespace, this method will just return the current char.

Returns:
the first non-whitespace character found
Throws:
IOException - if there is an error reading from the file

readLongBackwards

protected long readLongBackwards()
                          throws IOException,
                                 NumberFormatException
This will consume any whitespace, read in bytes until whitespace is found again and then parse the characters which have been read as a long. The current offset will then point at the first whitespace character which preceeds the number.

Returns:
the parsed number
Throws:
IOException - if there is an error reading from the file
NumberFormatException - if the bytes read can not be converted to a number

readInt

protected int readInt()
               throws IOException
Description copied from class: BaseParser
This will read an integer from the stream.

Overrides:
readInt in class BaseParser
Returns:
The integer that was read from the stream.
Throws:
IOException - If there is an error reading from the stream.

readNumber

protected COSNumber readNumber()
                        throws IOException
This will read in a number and return the COS version of the number (be it a COSInteger or a COSFloat).

Returns:
the COSNumber which was read/parsed
Throws:
IOException

parseNumber

protected COSNumber parseNumber(String number)
                         throws IOException
Throws:
IOException

processCosObject

protected COSBase processCosObject(String string)
                            throws IOException
Throws:
IOException

readObjectBackwards

protected COSBase readObjectBackwards()
                               throws IOException
Throws:
IOException

readNameBackwards

protected COSName readNameBackwards()
                             throws IOException
Throws:
IOException

getObject

public COSBase getObject(long objectNumber,
                         long generation)
                  throws IOException
Throws:
IOException

readObject

public COSBase readObject(long objectNumber,
                          long generation)
                   throws IOException
This will read an object from the inputFile at whatever our currentOffset is. If the object and generation are not the expected values and this object is set to throw an exception for non-conforming documents, then an exception will be thrown.

Parameters:
objectNumber - the object number you expect to read
generation - the generation you expect this object to be
Returns:
Throws:
IOException

readObject

protected COSBase readObject()
                      throws IOException
This actually reads the object data.

Returns:
the object which is read
Throws:
IOException

readString

protected String readString()
                     throws IOException
This will read the next string from the stream.

Overrides:
readString in class BaseParser
Returns:
The string that was read from the stream.
Throws:
IOException - If there is an error reading from the stream.

readDictionaryBackwards

protected COSDictionary readDictionaryBackwards()
                                         throws IOException
Throws:
IOException

readLineBackwards

protected String readLineBackwards()
                            throws IOException
This will read a line starting with the byte at offset and going backwards until it finds a newline. This should only be used if we are certain that the data will only be text, and not binary data.

Parameters:
offset - the location of the file where we should start reading
Returns:
the string which was read
Throws:
IOException - if there was an error reading data from the file

readLine

protected String readLine()
                   throws IOException
This will read a line starting with the byte at offset and going forward until it finds a newline. This should only be used if we are certain that the data will only be text, and not binary data.

Overrides:
readLine in class BaseParser
Parameters:
offset - the location of the file where we should start reading
Returns:
the string which was read
Throws:
IOException - if there was an error reading data from the file

readWord

protected String readWord()
                   throws IOException
Throws:
IOException

isRecursivlyRead

public boolean isRecursivlyRead()
Returns:
the recursivlyRead

setRecursivlyRead

public void setRecursivlyRead(boolean recursivlyRead)
Parameters:
recursivlyRead - the recursivlyRead to set


Copyright © 2002-2012 The Apache Software Foundation. All Rights Reserved.