Class PDFMarkedContent2XHTML


  • public class PDFMarkedContent2XHTML
    extends org.apache.pdfbox.text.PDFTextStripper

    This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.

    Since:
    1.24
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int getCurrentPageNo()
      we need to override this because we are overriding processPages(PDPageTree)
      int getStartPage()  
      static void process​(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config)
      Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
      void processPage​(org.apache.pdfbox.pdmodel.PDPage page)  
      void setEndBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)  
      void setStartBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)  
      void setStartPage​(int startPage)  
      • Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

        getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeText
      • Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

        addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
    • Method Detail

      • process

        public static void process​(org.apache.pdfbox.pdmodel.PDDocument pdDocument,
                                   ContentHandler handler,
                                   ParseContext context,
                                   Metadata metadata,
                                   PDFParserConfig config)
                            throws SAXException,
                                   TikaException
        Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.
        Parameters:
        pdDocument - PDF document
        handler - SAX content handler
        metadata - PDF metadata
        Throws:
        SAXException - if the content handler fails to process SAX events
        TikaException - if there was an exception outside of per page processing
      • processPage

        public void processPage​(org.apache.pdfbox.pdmodel.PDPage page)
                         throws IOException
        Overrides:
        processPage in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        IOException
      • getCurrentPageNo

        public int getCurrentPageNo()
        we need to override this because we are overriding processPages(PDPageTree)
        Returns:
      • setStartBookmark

        public void setStartBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
        Overrides:
        setStartBookmark in class org.apache.pdfbox.text.PDFTextStripper
      • setEndBookmark

        public void setEndBookmark​(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem)
        Overrides:
        setEndBookmark in class org.apache.pdfbox.text.PDFTextStripper
      • setStartPage

        public void setStartPage​(int startPage)
        Overrides:
        setStartPage in class org.apache.pdfbox.text.PDFTextStripper
      • getStartPage

        public int getStartPage()
        Overrides:
        getStartPage in class org.apache.pdfbox.text.PDFTextStripper