Class TextExtractor


  • public class TextExtractor
    extends Object
    Filters the text extracted from HTML documents, used by JSoupParserBolt. Configured with optional inclusion patterns based on JSoup selectors, as well as a list of tags to be excluded.

    Replaces ContentFilter.

    The first matching inclusion pattern is used or the whole document if no expressions are configured or no match has been found.

    The TextExtraction can be configured as so:

    
     textextractor.include.pattern:
      - DIV[id="maincontent"]
      - DIV[itemprop="articleBody"]
      - ARTICLE
    
     textextractor.exclude.tags:
      - STYLE
      - SCRIPT
    
     
    Since:
    1.13
    • Constructor Detail

    • Method Detail

      • text

        public String text​(org.jsoup.nodes.Element element)
      • traverse

        public static void traverse​(org.jsoup.select.NodeVisitor visitor,
                                    org.jsoup.nodes.Node root,
                                    int maxSize,
                                    StringBuilder builder)
        Start a depth-first traverse of the root and all of its descendants.
        Parameters:
        visitor - Node visitor.
        root - the root node point to traverse.