Class HashCrawlMapper

All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class HashCrawlMapper extends CrawlMapper
Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey.
Version:
$Date$, $Revision$
Author:
gojomo
  • Field Details

    • frontier

      protected Frontier frontier
    • crawlerCount

      protected long crawlerCount
      Number of crawlers among which to split up the URIs. Their names are assumed to be 0..N-1.
  • Constructor Details

    • HashCrawlMapper

      public HashCrawlMapper()
      Constructor.
  • Method Details

    • getFrontier

      public Frontier getFrontier()
    • setFrontier

      @Autowired public void setFrontier(Frontier frontier)
    • getCrawlerCount

      public long getCrawlerCount()
    • setCrawlerCount

      public void setCrawlerCount(long count)
    • getUsePublicSuffixesRegex

      public boolean getUsePublicSuffixesRegex()
    • setUsePublicSuffixesRegex

      public void setUsePublicSuffixesRegex(boolean usePublicSuffixes)
      Whether to use the PublicSuffixes-supplied reduce regex.
    • getReducePrefixRegex

      public String getReducePrefixRegex()
    • setReducePrefixRegex

      public void setReducePrefixRegex(String regex)
      A regex pattern to apply to the classKey, using the first match as the mapping key. If empty (the default), use the full classKey.
    • map

      protected String map(CrawlURI cauri)
      Look up the crawler node name to which the given CrawlURI should be mapped.
      Specified by:
      map in class CrawlMapper
      Parameters:
      cauri - CrawlURI to consider
      Returns:
      String node name which should handle URI
    • getReduceRegex

      protected String getReduceRegex(CrawlURI cauri)
    • mapString

      public static String mapString(String key, String reducePattern, long bucketCount)