Class PDFParser

java.lang.Object
org.archive.modules.extractor.PDFParser
All Implemented Interfaces:
Closeable, AutoCloseable

public class PDFParser extends Object implements Closeable
Supports PDF parsing operations. For now this primarily means extracting URIs, but the logic in extractURIs() could easily be adopted/extended for a variety of PDF processing tasks.
Author:
Parker Thompson
  • Field Details

    • foundURIs

      protected ArrayList<String> foundURIs
    • documentReader

      protected org.apache.pdfbox.pdmodel.PDDocument documentReader
    • document

      protected byte[] document
  • Constructor Details

  • Method Details

    • resetState

      protected void resetState()
      Reinitialize the object as though a new one were created.
    • resetState

      public void resetState(byte[] doc) throws IOException
      Reset the object and initialize it with a new byte array (the document).
      Parameters:
      doc -
      Throws:
      IOException
    • resetState

      public void resetState(String doc) throws IOException
      Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read
      Parameters:
      doc -
      Throws:
      IOException
    • getInFromFile

      protected void getInFromFile(String doc) throws IOException
      Read a file named 'doc' and store its' bytes for later processing.
      Parameters:
      doc -
      Throws:
      IOException
    • getURIs

      public ArrayList<String> getURIs()
      Get a list of URIs retrieved from the Pdf during the extractURIs operation.
      Returns:
      A list of URIs retrieved from the Pdf during the extractURIs operation.
    • initialize

      protected void initialize() throws IOException
      Initialize opens the document for reading. This is done implicitly by the constuctor. This should only need to be called directly following a reset.
      Throws:
      IOException
    • extractURIs

      public ArrayList<String> extractURIs() throws IOException
      Extract URIs from all objects found in a Pdf document's catalog. Returns an array list representing all URIs found in the document catalog tree.
      Returns:
      URIs from all objects found in a Pdf document's catalog.
      Throws:
      IOException
    • close

      public void close() throws IOException
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Throws:
      IOException
    • main

      public static void main(String[] argv)