Class CollectionTagger

  • All Implemented Interfaces:
    JSONResource, Configurable

    public class CollectionTagger
    extends ParseFilter
    implements JSONResource
    Assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.

    The resource file must specifify regular expressions for inclusions but also for exclusions e.g.

     {
       "collections": [{
                "name": "stormcrawler",
                "includePatterns": ["http://stormcrawler.net/.+"]
            },
            {
                "name": "crawler",
                "includePatterns": [".+crawler.+", ".+nutch.+"],
                "excludePatterns": [".+baby.+", ".+spider.+"]
            }
        ]
     }
     
    See Also:
    collections in Google Search Appliance

    This resources was kindly donated by the Government of Northwestern Territories in Canada (http://www.gov.nt.ca/).

    • Constructor Detail

      • CollectionTagger

        public CollectionTagger()
    • Method Detail

      • configure

        public void configure​(@NotNull
                              @NotNull Map<String,​Object> stormConf,
                              @NotNull
                              @NotNull com.fasterxml.jackson.databind.JsonNode filterParams)
        Description copied from interface: Configurable
        Called when this filter is being initialized
        Specified by:
        configure in interface Configurable
        Parameters:
        stormConf - The Storm configuration used for the configurable
        filterParams - the filter specific configuration. Never null
      • loadJSONResources

        public void loadJSONResources​(InputStream inputStream)
                               throws com.fasterxml.jackson.core.JsonParseException,
                                      com.fasterxml.jackson.databind.JsonMappingException,
                                      IOException
        Description copied from interface: JSONResource
        Load the resources from an input stream
        Specified by:
        loadJSONResources in interface JSONResource
        Throws:
        com.fasterxml.jackson.core.JsonParseException
        com.fasterxml.jackson.databind.JsonMappingException
        IOException
      • filter

        public void filter​(String URL,
                           byte[] content,
                           DocumentFragment doc,
                           ParseResult parse)
        Description copied from class: ParseFilter
        Called when parsing a specific page
        Specified by:
        filter in class ParseFilter
        Parameters:
        URL - the URL of the page being parsed
        content - the content being parsed
        doc - the DOM tree resulting of the parsing of the content or null if ParseFilter.needsDOM() returns false
        parse - the metadata to be updated with the resulting of the parsing