Class RegexURLFilterBase
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.filtering.URLFilter
-
- com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
-
- All Implemented Interfaces:
Configurable
- Direct Known Subclasses:
RegexURLFilter
public abstract class RegexURLFilterBase extends URLFilter
An abstract class for implementing Regex URL filtering. Adapted from Apache Nutch 1.9
-
-
Constructor Summary
Constructors Constructor Description RegexURLFilterBase()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
configure(@NotNull Map<String,Object> stormConf, @NotNull com.fasterxml.jackson.databind.JsonNode paramNode)
Called when this filter is being initializedprotected abstract RegexRule
createRule(boolean sign, String regex)
Creates a newRegexRule
.@Nullable String
filter(@Nullable URL pageUrl, @Nullable Metadata sourceMetadata, @NotNull String url)
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
-
-
-
Method Detail
-
configure
public void configure(@NotNull @NotNull Map<String,Object> stormConf, @NotNull @NotNull com.fasterxml.jackson.databind.JsonNode paramNode)
Description copied from interface:Configurable
Called when this filter is being initialized- Parameters:
stormConf
- The Storm configuration used for the configurableparamNode
- the filter specific configuration. Never null
-
createRule
protected abstract RegexRule createRule(boolean sign, String regex)
Creates a newRegexRule
.- Parameters:
sign
- of the regular expression. Atrue
value means that any URL matching this rule must be included, whereas afalse
value means that any URL matching this rule must be excluded.regex
- is the regular expression associated to this rule.
-
filter
@Nullable public @Nullable String filter(@Nullable @Nullable URL pageUrl, @Nullable @Nullable Metadata sourceMetadata, @NotNull @NotNull String url)
Description copied from class:URLFilter
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL- Specified by:
filter
in classURLFilter
- Parameters:
pageUrl
- the URL of the page where the URL was found. Can be null.sourceMetadata
- the metadata collected for the pageurl
- the URL to be filtered- Returns:
- null if the url is to be removed or a normalised representation which can correspond to the input URL
-
-