java.lang.Object

org.archive.modules.extractor.PDFParser

All Implemented Interfaces:: Closeable, AutoCloseable

public class PDFParser extends Object implements Closeable

Supports PDF parsing operations. For now this primarily means extracting URIs, but the logic in extractURIs() could easily be adopted/extended for a variety of PDF processing tasks.

Author:: Parker Thompson

Field Summary

Fields

Modifier and Type

Field

Description

protected byte[]

document

protected org.apache.pdfbox.pdmodel.PDDocument

documentReader

protected ArrayList<String>

foundURIs
Constructor Summary

Constructors

Constructor

Description

PDFParser(byte[] doc)

PDFParser(String doc)
Method Summary

Modifier and Type

Method

Description

void

close()

ArrayList<String>

extractURIs()

Extract URIs from all objects found in a Pdf document's catalog.

protected void

getInFromFile(String doc)

Read a file named 'doc' and store its' bytes for later processing.

ArrayList<String>

getURIs()

Get a list of URIs retrieved from the Pdf during the extractURIs operation.

protected void

initialize()

Initialize opens the document for reading.

static void

main(String[] argv)

protected void

resetState()

Reinitialize the object as though a new one were created.

void

resetState(byte[] doc)

Reset the object and initialize it with a new byte array (the document).

void

resetState(String doc)

Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- foundURIs
  
  protected ArrayList<String> foundURIs
- documentReader
  
  protected org.apache.pdfbox.pdmodel.PDDocument documentReader
- document
  
  protected byte[] document
Constructor Details
- PDFParser
  
  public PDFParser(String doc) throws IOException
  
  Throws:
  
  IOException
- PDFParser
  
  public PDFParser(byte[] doc) throws IOException
  
  Throws:
  
  IOException
Method Details
- resetState
  
  protected void resetState()
  
  Reinitialize the object as though a new one were created.
- resetState
  
  public void resetState(byte[] doc) throws IOException
  
  Reset the object and initialize it with a new byte array (the document).
  
  Parameters:
  
  doc -
  
  Throws:
  
  IOException
- resetState
  
  public void resetState(String doc) throws IOException
  
  Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read
  
  Parameters:
  
  doc -
  
  Throws:
  
  IOException
- getInFromFile
  
  protected void getInFromFile(String doc) throws IOException
  
  Read a file named 'doc' and store its' bytes for later processing.
  
  Parameters:
  
  doc -
  
  Throws:
  
  IOException
- getURIs
  
  public ArrayList<String> getURIs()
  
  Get a list of URIs retrieved from the Pdf during the extractURIs operation.
  
  Returns:
  
  A list of URIs retrieved from the Pdf during the extractURIs operation.
- initialize
  
  protected void initialize() throws IOException
  
  Initialize opens the document for reading. This is done implicitly by the constuctor. This should only need to be called directly following a reset.
  
  Throws:
  
  IOException
- extractURIs
  
  public ArrayList<String> extractURIs() throws IOException
  
  Extract URIs from all objects found in a Pdf document's catalog. Returns an array list representing all URIs found in the document catalog tree.
  
  Returns:
  
  URIs from all objects found in a Pdf document's catalog.
  
  Throws:
  
  IOException
- close
  
  public void close() throws IOException
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable
  
  Throws:
  
  IOException
- main
  
  public static void main(String[] argv)

Class PDFParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

foundURIs

documentReader

document

Constructor Details

PDFParser

PDFParser

Method Details

resetState

resetState

resetState

getInFromFile

getURIs

initialize

extractURIs

close

main