java.lang.Object
- com.digitalpebble.stormcrawler.parse.TextExtractor

```
public class TextExtractor
extends Object
```
Filters the text extracted from HTML documents, used by JSoupParserBolt. Configured with optional inclusion patterns based on JSoup selectors, as well as a list of tags to be excluded.
Replaces ContentFilter.
The first matching inclusion pattern is used or the whole document if no expressions are configured or no match has been found.
The TextExtraction can be configured as so:
```
 textextractor.include.pattern:
  - DIV[id="maincontent"]
  - DIV[itemprop="articleBody"]
  - ARTICLE

 textextractor.exclude.tags:
  - STYLE
  - SCRIPT

 
```
Since:

1.13

Field Summary

Fields
Modifier and Type	Field	Description
`static String`	`EXCLUDE_PARAM_NAME`
`static String`	`INCLUDE_PARAM_NAME`
`static String`	`NO_TEXT_PARAM_NAME`
`static String`	`TEXT_MAX_TEXT_PARAM_NAME`

Constructor Summary

Constructors
Constructor Description

TextExtractor(Map<String,Object> stormConf)

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`String`	`text(org.jsoup.nodes.Element element)`
`static void`	`traverse(org.jsoup.select.NodeVisitor visitor, org.jsoup.nodes.Node root, int maxSize, StringBuilder builder)`	Start a depth-first traverse of the root and all of its descendants.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- INCLUDE_PARAM_NAME
```
public static final String INCLUDE_PARAM_NAME
```
  See Also:
  
  Constant Field Values
- EXCLUDE_PARAM_NAME
```
public static final String EXCLUDE_PARAM_NAME
```
  See Also:
  
  Constant Field Values
- NO_TEXT_PARAM_NAME
```
public static final String NO_TEXT_PARAM_NAME
```
  See Also:
  
  Constant Field Values
- TEXT_MAX_TEXT_PARAM_NAME
```
public static final String TEXT_MAX_TEXT_PARAM_NAME
```
  See Also:
  
  Constant Field Values

Constructor Detail

TextExtractor

public TextExtractor(Map<String,Object> stormConf)

Method Detail

text

public String text(org.jsoup.nodes.Element element)

traverse

public static void traverse(org.jsoup.select.NodeVisitor visitor,
                            org.jsoup.nodes.Node root,
                            int maxSize,
                            StringBuilder builder)

Start a depth-first traverse of the root and all of its descendants.

Parameters:: visitor - Node visitor.; root - the root node point to traverse.

Class TextExtractor

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

INCLUDE_PARAM_NAME

EXCLUDE_PARAM_NAME

NO_TEXT_PARAM_NAME

TEXT_MAX_TEXT_PARAM_NAME

Constructor Detail

TextExtractor

Method Detail

text

traverse