Class RegexURLNormalizer

  • All Implemented Interfaces:
    Configurable

    public class RegexURLNormalizer
    extends URLFilter
    The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string.

    Adapted from Apache Nutch 1.9.

    • Constructor Detail

      • RegexURLNormalizer

        public RegexURLNormalizer()
    • Method Detail

      • configure

        public void configure​(@NotNull
                              @NotNull Map<String,​Object> stormConf,
                              @NotNull
                              @NotNull com.fasterxml.jackson.databind.JsonNode paramNode)
        Description copied from interface: Configurable
        Called when this filter is being initialized
        Parameters:
        stormConf - The Storm configuration used for the configurable
        paramNode - the filter specific configuration. Never null
      • filter

        @Nullable
        public @Nullable String filter​(@Nullable
                                       @Nullable URL sourceUrl,
                                       @Nullable
                                       @Nullable Metadata sourceMetadata,
                                       @NotNull
                                       @NotNull String urlString)
        This function does the replacements by iterating through all the regex patterns. It accepts a string url as input and returns the altered string. If the normalized url is an empty string, the function will return null.
        Specified by:
        filter in class URLFilter
        Parameters:
        sourceUrl -
        sourceMetadata -
        urlString -
        Returns:
      • main

        public static void main​(String[] args)
                         throws FileNotFoundException
        Utility method to test rules against an input. the first arg is the absolute path of the rules file, the second is the URL to be normalised
        Throws:
        FileNotFoundException