Class TextExtractor
- java.lang.Object
-
- com.digitalpebble.stormcrawler.parse.TextExtractor
-
public class TextExtractor extends Object
Filters the text extracted from HTML documents, used by JSoupParserBolt. Configured with optional inclusion patterns based on JSoup selectors, as well as a list of tags to be excluded.Replaces ContentFilter.
The first matching inclusion pattern is used or the whole document if no expressions are configured or no match has been found.
The TextExtraction can be configured as so:
textextractor.include.pattern: - DIV[id="maincontent"] - DIV[itemprop="articleBody"] - ARTICLE textextractor.exclude.tags: - STYLE - SCRIPT
- Since:
- 1.13
-
-
Field Summary
Fields Modifier and Type Field Description static String
EXCLUDE_PARAM_NAME
static String
INCLUDE_PARAM_NAME
static String
NO_TEXT_PARAM_NAME
static String
TEXT_MAX_TEXT_PARAM_NAME
-
Constructor Summary
Constructors Constructor Description TextExtractor(Map<String,Object> stormConf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description String
text(org.jsoup.nodes.Element element)
static void
traverse(org.jsoup.select.NodeVisitor visitor, org.jsoup.nodes.Node root, int maxSize, StringBuilder builder)
Start a depth-first traverse of the root and all of its descendants.
-
-
-
Field Detail
-
INCLUDE_PARAM_NAME
public static final String INCLUDE_PARAM_NAME
- See Also:
- Constant Field Values
-
EXCLUDE_PARAM_NAME
public static final String EXCLUDE_PARAM_NAME
- See Also:
- Constant Field Values
-
NO_TEXT_PARAM_NAME
public static final String NO_TEXT_PARAM_NAME
- See Also:
- Constant Field Values
-
TEXT_MAX_TEXT_PARAM_NAME
public static final String TEXT_MAX_TEXT_PARAM_NAME
- See Also:
- Constant Field Values
-
-
Method Detail
-
text
public String text(org.jsoup.nodes.Element element)
-
traverse
public static void traverse(org.jsoup.select.NodeVisitor visitor, org.jsoup.nodes.Node root, int maxSize, StringBuilder builder)
Start a depth-first traverse of the root and all of its descendants.- Parameters:
visitor
- Node visitor.root
- the root node point to traverse.
-
-