PDFText2HTML (Apache PDFBox 1.1.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.pdfbox.util
Class PDFText2HTML

java.lang.Object
  org.apache.pdfbox.util.PDFStreamEngine
      org.apache.pdfbox.util.PDFTextStripper
          org.apache.pdfbox.util.PDFText2HTML

public class PDFText2HTML
extends PDFTextStripper
extends PDFTextStripper

Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.

Version:: $Revision: 1.3 $
Author:: jjb - http://www.johnjbarton.com

Field Summary

Fields inherited from class org.apache.pdfbox.util.PDFTextStripper
`charactersByArticle, document, lineSeparator, output, outputEncoding`

Constructor Summary
`PDFText2HTML(String encoding)` Constructor.

Method Summary
`protected void`	`endArticle()` Write out the article separator.
`void`	`endDocument(PDDocument pdf)` This method is available for subclasses of this class.
`protected String`	`getTitle()` This method will attempt to guess the title of the document using either the document properties or the first lines of text.
`protected void`	`startArticle(boolean isltr)` Write out the article separator (div tag) with proper text direction information.
`protected void`	`writeHeader()` Write the header to the output document.
`protected void`	`writePage()` This will print the text of the processed page to "output".
`protected void`	`writeString(String chars)` Write a string to the output stream and escape some HTML characters.

Methods inherited from class org.apache.pdfbox.util.PDFTextStripper
endPage, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getEndBookmark, getEndPage, getLineSeparator, getOutput, getPageSeparator, getSpacingTolerance, getStartBookmark, getStartPage, getText, getText, getWordSeparator, processPage, processPages, processTextPosition, setAverageCharTolerance, setEndBookmark, setEndPage, setLineSeparator, setPageSeparator, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, shouldSeparateByBeads, shouldSortByPosition, shouldSuppressDuplicateOverlappingText, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageSeperator, writeText, writeText, writeWordSeparator

Methods inherited from class org.apache.pdfbox.util.PDFTextStripper

endPage, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getEndBookmark, getEndPage, getLineSeparator, getOutput, getPageSeparator, getSpacingTolerance, getStartBookmark, getStartPage, getText, getText, getWordSeparator, processPage, processPages, processTextPosition, setAverageCharTolerance, setEndBookmark, setEndPage, setLineSeparator, setPageSeparator, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, shouldSeparateByBeads, shouldSortByPosition, shouldSuppressDuplicateOverlappingText, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageSeperator, writeText, writeText, writeWordSeparator

Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
`getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix`

Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine

getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

PDFText2HTML

public PDFText2HTML(String encoding)
             throws IOException

Constructor.

Parameters:: encoding - The encoding to be used
Throws:: IOException - If there is an error during initialization.

Method Detail

writeHeader

protected void writeHeader()
                    throws IOException

Write the header to the output document. Now also writes the tag defining the character encoding.

Throws:: IOException - If there is a problem writing out the header to the document.

writePage

protected void writePage()
                  throws IOException

This will print the text of the processed page to "output". It will estimate, based on the coordinates of the text, where newlines and word spacings should be placed. The text will be sorted only if that feature was enabled.

Overrides:: writePage in class PDFTextStripper

Throws:: IOException - If there is an error writing the text.

endDocument

public void endDocument(PDDocument pdf)
                 throws IOException

This method is available for subclasses of this class. It will be called after processing of the document finishes.

Overrides:: endDocument in class PDFTextStripper

Parameters:: pdf - The PDF document that is being processed.
Throws:: IOException - If an IO error occurs.

getTitle

protected String getTitle()

This method will attempt to guess the title of the document using either the document properties or the first lines of text.

Returns:: returns the title.

startArticle

protected void startArticle(boolean isltr)
                     throws IOException

Write out the article separator (div tag) with proper text direction information.

Overrides:: startArticle in class PDFTextStripper

Parameters:: isltr - true if direction of text is left to right
Throws:: IOException - If there is an error writing to the stream.

endArticle

protected void endArticle()
                   throws IOException

Write out the article separator.

Overrides:: endArticle in class PDFTextStripper

Throws:: IOException - If there is an error writing to the stream.

writeString

protected void writeString(String chars)
                    throws IOException

Write a string to the output stream and escape some HTML characters.

Overrides:: writeString in class PDFTextStripper

Parameters:: chars - String to be written to the stream
Throws:: IOException - If there is an error writing to the stream.