Class FastURLFilter
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.filtering.URLFilter
-
- com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
-
- All Implemented Interfaces:
JSONResource
,Configurable
public class FastURLFilter extends URLFilter implements JSONResource
URL filter based on regex patterns and organised by [host | domain | metadata | global]. For a given URL, the scopes are tried in the order given above and the URL is kept or removed based on the first matching rule. The default policy is to accept a URL if no matches are found.The resource file is in JSON and at the following format.
{ "rules" : [ { "scope" : "GLOBAL", "patterns" : [ "DenyPathQuery \\.jpg" ] }, { "scope" : "domain:stormcrawler.net", "patterns" : [ "AllowPath /digitalpebble/", "DenyPath .+" ] }, { "scope" : "metadata:key=value", "patterns" : [ "DenyPath .+" ] } ] }
Partly inspired by https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter /src/plugin/urlfilter -fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
-
-
Field Summary
Fields Modifier and Type Field Description static org.slf4j.Logger
LOG
-
Constructor Summary
Constructors Constructor Description FastURLFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
configure(@NotNull Map<String,Object> stormConf, @NotNull com.fasterxml.jackson.databind.JsonNode filterParams)
Called when this filter is being initialized@Nullable String
filter(@Nullable URL sourceUrl, @Nullable Metadata sourceMetadata, @NotNull String urlToFilter)
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URLString
getResourceFile()
void
loadJSONResources(InputStream inputStream)
Load the resources from an input stream-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.digitalpebble.stormcrawler.JSONResource
loadJSONResources
-
-
-
-
Method Detail
-
configure
public void configure(@NotNull @NotNull Map<String,Object> stormConf, @NotNull @NotNull com.fasterxml.jackson.databind.JsonNode filterParams)
Description copied from interface:Configurable
Called when this filter is being initialized- Specified by:
configure
in interfaceConfigurable
- Parameters:
stormConf
- The Storm configuration used for the configurablefilterParams
- the filter specific configuration. Never null
-
getResourceFile
public String getResourceFile()
- Specified by:
getResourceFile
in interfaceJSONResource
- Returns:
- filename of the JSON resource
-
loadJSONResources
public void loadJSONResources(InputStream inputStream) throws com.fasterxml.jackson.core.JsonParseException, com.fasterxml.jackson.databind.JsonMappingException, IOException
Description copied from interface:JSONResource
Load the resources from an input stream- Specified by:
loadJSONResources
in interfaceJSONResource
- Throws:
com.fasterxml.jackson.core.JsonParseException
com.fasterxml.jackson.databind.JsonMappingException
IOException
-
filter
@Nullable public @Nullable String filter(@Nullable @Nullable URL sourceUrl, @Nullable @Nullable Metadata sourceMetadata, @NotNull @NotNull String urlToFilter)
Description copied from class:URLFilter
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL- Specified by:
filter
in classURLFilter
- Parameters:
sourceUrl
- the URL of the page where the URL was found. Can be null.sourceMetadata
- the metadata collected for the pageurlToFilter
- the URL to be filtered- Returns:
- null if the url is to be removed or a normalised representation which can correspond to the input URL
-
-