Runs OCR on each page of the input that has fewer than 100 characters of text, and outputs a valid, searchable PDF.
Runs OCR on each page of the input that has fewer than 100 characters of text, and outputs a valid, searchable PDF.
This method can throw some exceptions that are entirely natural:
* PdfEncryptedException
: the input PDF needs a password.
* PdfInvalidException
: the input PDF contains unrecoverable errors.
* TesseractMissingException
: Tesseract cannot be run.
* TesseractLanguageMissingException
: Tesseract needs a language file.
It may also throw exceptions you should probably never see:
* FileNotFoundException
: the input file or output directory is missing.
* SecurityException
: you cannot read the input or write the output.
* TesseractFailedException
: Tesseract did not run properly.
* OutOfMemoryException
: PDFBox has an evil bug.
If this method returns a failure, or if progress()
returns false
,
out
will not be written.
Path to input, which must be a valid PDF file.
Path to output, which will be overwritten or deleted.
Languages to use for OCR.
Method to call with (nPagesCompleted, nPagesTotal) every
page. The first call will be (0, nPagesTotal) and the
last call will be (nPagesTotal, nPagesTotal). If the
method ever returns false
, the future will resolve and
out
will not be written. (This is how callers can
cancel a lengthy OCR Process.)
Utility methods for dealing with PDFs.