Class CollectionTagger
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.parse.ParseFilter
-
- com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
-
- All Implemented Interfaces:
JSONResource
,Configurable
public class CollectionTagger extends ParseFilter implements JSONResource
Assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.The resource file must specifify regular expressions for inclusions but also for exclusions e.g.
{ "collections": [{ "name": "stormcrawler", "includePatterns": ["http://stormcrawler.net/.+"] }, { "name": "crawler", "includePatterns": [".+crawler.+", ".+nutch.+"], "excludePatterns": [".+baby.+", ".+spider.+"] } ] }
- See Also:
- collections
in Google Search Appliance
This resources was kindly donated by the Government of Northwestern Territories in Canada (http://www.gov.nt.ca/).
-
-
Constructor Summary
Constructors Constructor Description CollectionTagger()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
configure(@NotNull Map<String,Object> stormConf, @NotNull com.fasterxml.jackson.databind.JsonNode filterParams)
Called when this filter is being initializedvoid
filter(String URL, byte[] content, DocumentFragment doc, ParseResult parse)
Called when parsing a specific pageString
getResourceFile()
void
loadJSONResources(InputStream inputStream)
Load the resources from an input stream-
Methods inherited from class com.digitalpebble.stormcrawler.parse.ParseFilter
needsDOM
-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.digitalpebble.stormcrawler.JSONResource
loadJSONResources
-
-
-
-
Method Detail
-
configure
public void configure(@NotNull @NotNull Map<String,Object> stormConf, @NotNull @NotNull com.fasterxml.jackson.databind.JsonNode filterParams)
Description copied from interface:Configurable
Called when this filter is being initialized- Specified by:
configure
in interfaceConfigurable
- Parameters:
stormConf
- The Storm configuration used for the configurablefilterParams
- the filter specific configuration. Never null
-
getResourceFile
public String getResourceFile()
- Specified by:
getResourceFile
in interfaceJSONResource
- Returns:
- filename of the JSON resource
-
loadJSONResources
public void loadJSONResources(InputStream inputStream) throws com.fasterxml.jackson.core.JsonParseException, com.fasterxml.jackson.databind.JsonMappingException, IOException
Description copied from interface:JSONResource
Load the resources from an input stream- Specified by:
loadJSONResources
in interfaceJSONResource
- Throws:
com.fasterxml.jackson.core.JsonParseException
com.fasterxml.jackson.databind.JsonMappingException
IOException
-
filter
public void filter(String URL, byte[] content, DocumentFragment doc, ParseResult parse)
Description copied from class:ParseFilter
Called when parsing a specific page- Specified by:
filter
in classParseFilter
- Parameters:
URL
- the URL of the page being parsedcontent
- the content being parseddoc
- the DOM tree resulting of the parsing of the content or null ifParseFilter.needsDOM()
returnsfalse
parse
- the metadata to be updated with the resulting of the parsing
-
-