Class RegexURLNormalizer
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.filtering.URLFilter
-
- com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
-
- All Implemented Interfaces:
Configurable
public class RegexURLNormalizer extends URLFilter
The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string.Adapted from Apache Nutch 1.9.
-
-
Constructor Summary
Constructors Constructor Description RegexURLNormalizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
configure(@NotNull Map<String,Object> stormConf, @NotNull com.fasterxml.jackson.databind.JsonNode paramNode)
Called when this filter is being initialized@Nullable String
filter(@Nullable URL sourceUrl, @Nullable Metadata sourceMetadata, @NotNull String urlString)
This function does the replacements by iterating through all the regex patterns.static void
main(String[] args)
Utility method to test rules against an input.-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
-
-
-
Method Detail
-
configure
public void configure(@NotNull @NotNull Map<String,Object> stormConf, @NotNull @NotNull com.fasterxml.jackson.databind.JsonNode paramNode)
Description copied from interface:Configurable
Called when this filter is being initialized- Parameters:
stormConf
- The Storm configuration used for the configurableparamNode
- the filter specific configuration. Never null
-
filter
@Nullable public @Nullable String filter(@Nullable @Nullable URL sourceUrl, @Nullable @Nullable Metadata sourceMetadata, @NotNull @NotNull String urlString)
This function does the replacements by iterating through all the regex patterns. It accepts a string url as input and returns the altered string. If the normalized url is an empty string, the function will return null.
-
main
public static void main(String[] args) throws FileNotFoundException
Utility method to test rules against an input. the first arg is the absolute path of the rules file, the second is the URL to be normalised- Throws:
FileNotFoundException
-
-