Class RegexURLFilterBase

  • All Implemented Interfaces:
    Configurable
    Direct Known Subclasses:
    RegexURLFilter

    public abstract class RegexURLFilterBase
    extends URLFilter
    An abstract class for implementing Regex URL filtering. Adapted from Apache Nutch 1.9
    • Constructor Detail

      • RegexURLFilterBase

        public RegexURLFilterBase()
    • Method Detail

      • configure

        public void configure​(@NotNull
                              @NotNull Map<String,​Object> stormConf,
                              @NotNull
                              @NotNull com.fasterxml.jackson.databind.JsonNode paramNode)
        Description copied from interface: Configurable
        Called when this filter is being initialized
        Parameters:
        stormConf - The Storm configuration used for the configurable
        paramNode - the filter specific configuration. Never null
      • createRule

        protected abstract RegexRule createRule​(boolean sign,
                                                String regex)
        Creates a new RegexRule.
        Parameters:
        sign - of the regular expression. A true value means that any URL matching this rule must be included, whereas a false value means that any URL matching this rule must be excluded.
        regex - is the regular expression associated to this rule.
      • filter

        @Nullable
        public @Nullable String filter​(@Nullable
                                       @Nullable URL pageUrl,
                                       @Nullable
                                       @Nullable Metadata sourceMetadata,
                                       @NotNull
                                       @NotNull String url)
        Description copied from class: URLFilter
        Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL
        Specified by:
        filter in class URLFilter
        Parameters:
        pageUrl - the URL of the page where the URL was found. Can be null.
        sourceMetadata - the metadata collected for the page
        url - the URL to be filtered
        Returns:
        null if the url is to be removed or a normalised representation which can correspond to the input URL