Class ParseFilter
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.parse.ParseFilter
-
- All Implemented Interfaces:
Configurable
- Direct Known Subclasses:
CollectionTagger
,CommaSeparatedToMultivaluedMetadata
,DebugParseFilter
,DomainParseFilter
,LDJsonParseFilter
,MD5SignatureParseFilter
,MimeTypeNormalization
,ParseFilters
,XPathFilter
public abstract class ParseFilter extends AbstractConfigurable
Implementations of ParseFilter are responsible for extracting custom data from the crawled content. They are used by parsing bolts such asJSoupParserBolt
orSiteMapParserBolt
.
-
-
Constructor Summary
Constructors Constructor Description ParseFilter()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
filter(String URL, byte[] content, DocumentFragment doc, ParseResult parse)
Called when parsing a specific pageboolean
needsDOM()
Specifies whether this filter requires a DOM representation of the document-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.digitalpebble.stormcrawler.util.Configurable
configure
-
-
-
-
Method Detail
-
filter
public abstract void filter(String URL, byte[] content, DocumentFragment doc, ParseResult parse)
Called when parsing a specific page- Parameters:
URL
- the URL of the page being parsedcontent
- the content being parseddoc
- the DOM tree resulting of the parsing of the content or null ifneedsDOM()
returnsfalse
parse
- the metadata to be updated with the resulting of the parsing
-
needsDOM
public boolean needsDOM()
Specifies whether this filter requires a DOM representation of the document- Returns:
true
if this needs a DOM representation of the document,false
otherwise.
-
-