Class FastURLFilter

  • All Implemented Interfaces:
    JSONResource, Configurable

    public class FastURLFilter
    extends URLFilter
    implements JSONResource
    URL filter based on regex patterns and organised by [host | domain | metadata | global]. For a given URL, the scopes are tried in the order given above and the URL is kept or removed based on the first matching rule. The default policy is to accept a URL if no matches are found.

    The resource file is in JSON and at the following format.

     {
      "rules" : [ {
       "scope" : "GLOBAL",
        "patterns" : [ "DenyPathQuery \\.jpg" ]
      }, {
        "scope" : "domain:stormcrawler.net",
        "patterns" : [ "AllowPath /digitalpebble/", "DenyPath .+" ]
      }, {
        "scope" : "metadata:key=value",
       "patterns" : [ "DenyPath .+" ]
      } ]
     }
     
    Partly inspired by https://github.com/commoncrawl/nutch/blob/cc-fast-url-filter /src/plugin/urlfilter -fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
    • Field Detail

      • LOG

        public static final org.slf4j.Logger LOG
    • Constructor Detail

      • FastURLFilter

        public FastURLFilter()
    • Method Detail

      • configure

        public void configure​(@NotNull
                              @NotNull Map<String,​Object> stormConf,
                              @NotNull
                              @NotNull com.fasterxml.jackson.databind.JsonNode filterParams)
        Description copied from interface: Configurable
        Called when this filter is being initialized
        Specified by:
        configure in interface Configurable
        Parameters:
        stormConf - The Storm configuration used for the configurable
        filterParams - the filter specific configuration. Never null
      • loadJSONResources

        public void loadJSONResources​(InputStream inputStream)
                               throws com.fasterxml.jackson.core.JsonParseException,
                                      com.fasterxml.jackson.databind.JsonMappingException,
                                      IOException
        Description copied from interface: JSONResource
        Load the resources from an input stream
        Specified by:
        loadJSONResources in interface JSONResource
        Throws:
        com.fasterxml.jackson.core.JsonParseException
        com.fasterxml.jackson.databind.JsonMappingException
        IOException
      • filter

        @Nullable
        public @Nullable String filter​(@Nullable
                                       @Nullable URL sourceUrl,
                                       @Nullable
                                       @Nullable Metadata sourceMetadata,
                                       @NotNull
                                       @NotNull String urlToFilter)
        Description copied from class: URLFilter
        Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL
        Specified by:
        filter in class URLFilter
        Parameters:
        sourceUrl - the URL of the page where the URL was found. Can be null.
        sourceMetadata - the metadata collected for the page
        urlToFilter - the URL to be filtered
        Returns:
        null if the url is to be removed or a normalised representation which can correspond to the input URL