PDFMarkedContent2XHTML (The Adobe Experience Manager SDK 2020.6.3717.20200611T200904Z-200604)

java.lang.Object
- org.apache.pdfbox.contentstream.PDFStreamEngine
- - org.apache.pdfbox.text.PDFTextStripper
  - - org.apache.tika.parser.pdf.PDFMarkedContent2XHTML

```
public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
```
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.

Since:

1.24

Field Summary

Fields
Modifier and Type Field and Description

static String XMP_DOCUMENT_CATALOG_LOCATION

static String XMP_PAGE_LOCATION_PREFIX

Fields
Modifier and Type	Field and Description
`static String`	`XMP_DOCUMENT_CATALOG_LOCATION`
`static String`	`XMP_PAGE_LOCATION_PREFIX`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`int`	`getCurrentPageNo()` we need to override this because we are overriding `processPages(PDPageTree)`
`int`	`getStartPage()`
`static void`	`process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)` Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
`void`	`processPage(org.apache.pdfbox.pdmodel.PDPage page)`
`void`	`setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)`
`void`	`setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)`
`void`	`setStartPage(int startPage)`

Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText

Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - XMP_DOCUMENT_CATALOG_LOCATION
```
public static final String XMP_DOCUMENT_CATALOG_LOCATION
```
    See Also:
    
    Constant Field Values
  - XMP_PAGE_LOCATION_PREFIX
```
public static final String XMP_PAGE_LOCATION_PREFIX
```
    See Also:
    
    Constant Field Values
- Method Detail
  - process
```
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
                           ContentHandler handler,
                           ParseContext context,
                           Metadata metadata,
                           PDFParserConfig config)
                    throws SAXException,
                           TikaException
```
    Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
    
    Parameters:
    
    pdDocument - PDF document
    
    handler - SAX content handler
    
    metadata - PDF metadata
    
    Throws:
    
    SAXException - if the content handler fails to process SAX events
    
    TikaException - if there was an exception outside of per page processing
  - processPage
```
public void processPage(org.apache.pdfbox.pdmodel.PDPage page)
                 throws IOException
```
    Overrides:
    
    processPage in class org.apache.pdfbox.text.PDFTextStripper
    
    Throws:
    
    IOException
  - getCurrentPageNo
```
public int getCurrentPageNo()
```
    we need to override this because we are overriding processPages(PDPageTree)
    
    Overrides:
    
    getCurrentPageNo in class org.apache.pdfbox.text.PDFTextStripper
    
    Returns:
  - setStartBookmark
```
public void setStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
```
    Overrides:
    
    setStartBookmark in class org.apache.pdfbox.text.PDFTextStripper
  - setEndBookmark
```
public void setEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
```
    Overrides:
    
    setEndBookmark in class org.apache.pdfbox.text.PDFTextStripper
  - setStartPage
```
public void setStartPage(int startPage)
```
    Overrides:
    
    setStartPage in class org.apache.pdfbox.text.PDFTextStripper
  - getStartPage
```
public int getStartPage()
```
    Overrides:
    
    getStartPage in class org.apache.pdfbox.text.PDFTextStripper

Class PDFMarkedContent2XHTML

Field Summary

Method Summary

Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

Methods inherited from class java.lang.Object

Field Detail

XMP_DOCUMENT_CATALOG_LOCATION

XMP_PAGE_LOCATION_PREFIX

Method Detail

process

processPage

getCurrentPageNo

setStartBookmark

setEndBookmark

setStartPage

getStartPage