Package org.archive.modules.extractor
Class PDFParser
java.lang.Object
org.archive.modules.extractor.PDFParser
- All Implemented Interfaces:
Closeable
,AutoCloseable
Supports PDF parsing operations. For now this primarily means
extracting URIs, but the logic in extractURIs() could easily be adopted/extended
for a variety of PDF processing tasks.
- Author:
- Parker Thompson
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected byte[]
protected org.apache.pdfbox.pdmodel.PDDocument
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
Extract URIs from all objects found in a Pdf document's catalog.protected void
getInFromFile
(String doc) Read a file named 'doc' and store its' bytes for later processing.getURIs()
Get a list of URIs retrieved from the Pdf during the extractURIs operation.protected void
Initialize opens the document for reading.static void
protected void
Reinitialize the object as though a new one were created.void
resetState
(byte[] doc) Reset the object and initialize it with a new byte array (the document).void
resetState
(String doc) Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read
-
Field Details
-
foundURIs
-
documentReader
protected org.apache.pdfbox.pdmodel.PDDocument documentReader -
document
protected byte[] document
-
-
Constructor Details
-
PDFParser
- Throws:
IOException
-
PDFParser
- Throws:
IOException
-
-
Method Details
-
resetState
protected void resetState()Reinitialize the object as though a new one were created. -
resetState
Reset the object and initialize it with a new byte array (the document).- Parameters:
doc
-- Throws:
IOException
-
resetState
Reinitialize the object as though a new one were created, complete with a valid pointer to a document that can be read- Parameters:
doc
-- Throws:
IOException
-
getInFromFile
Read a file named 'doc' and store its' bytes for later processing.- Parameters:
doc
-- Throws:
IOException
-
getURIs
Get a list of URIs retrieved from the Pdf during the extractURIs operation.- Returns:
- A list of URIs retrieved from the Pdf during the extractURIs operation.
-
initialize
Initialize opens the document for reading. This is done implicitly by the constuctor. This should only need to be called directly following a reset.- Throws:
IOException
-
extractURIs
Extract URIs from all objects found in a Pdf document's catalog. Returns an array list representing all URIs found in the document catalog tree.- Returns:
- URIs from all objects found in a Pdf document's catalog.
- Throws:
IOException
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Throws:
IOException
-
main
-