Package org.archive.crawler.processor
Class HashCrawlMapper
java.lang.Object
org.archive.modules.Processor
org.archive.crawler.processor.CrawlMapper
org.archive.crawler.processor.HashCrawlMapper
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
Maps URIs to one of N crawler names by applying a hash to the
URI's (possibly-transformed) classKey.
- Version:
- $Date$, $Revision$
- Author:
- gojomo
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected long
Number of crawlers among which to split up the URIs.protected Frontier
Fields inherited from class org.archive.crawler.processor.CrawlMapper
cache, checkOutlinks, checkUri, diversionDir, diversionLogs, localName, logGeneration, outlinkRule, rotationDigits
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionlong
protected String
getReduceRegex
(CrawlURI cauri) boolean
protected String
Look up the crawler node name to which the given CrawlURI should be mapped.static String
void
setCrawlerCount
(long count) void
setFrontier
(Frontier frontier) void
setReducePrefixRegex
(String regex) A regex pattern to apply to the classKey, using the first match as the mapping key.void
setUsePublicSuffixesRegex
(boolean usePublicSuffixes) Whether to use the PublicSuffixes-supplied reduce regex.Methods inherited from class org.archive.crawler.processor.CrawlMapper
decideToMapOutlink, divertLog, getCheckOutlinks, getCheckUri, getDiversionDir, getDiversionLog, getLocalName, getOutlinkRule, getRotationDigits, innerProcess, innerProcessResult, isRunning, setCheckOutlinks, setCheckUri, setDiversionDir, setLocalName, setOutlinkRule, setRotationDigits, shouldProcess, start, stop, updateGeneration
Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerRejectProcess, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, startCheckpoint, toCheckpointJson
-
Field Details
-
frontier
-
crawlerCount
protected long crawlerCountNumber of crawlers among which to split up the URIs. Their names are assumed to be 0..N-1.
-
-
Constructor Details
-
HashCrawlMapper
public HashCrawlMapper()Constructor.
-
-
Method Details
-
getFrontier
-
setFrontier
-
getCrawlerCount
public long getCrawlerCount() -
setCrawlerCount
public void setCrawlerCount(long count) -
getUsePublicSuffixesRegex
public boolean getUsePublicSuffixesRegex() -
setUsePublicSuffixesRegex
public void setUsePublicSuffixesRegex(boolean usePublicSuffixes) Whether to use the PublicSuffixes-supplied reduce regex. -
getReducePrefixRegex
-
setReducePrefixRegex
A regex pattern to apply to the classKey, using the first match as the mapping key. If empty (the default), use the full classKey. -
map
Look up the crawler node name to which the given CrawlURI should be mapped.- Specified by:
map
in classCrawlMapper
- Parameters:
cauri
- CrawlURI to consider- Returns:
- String node name which should handle URI
-
getReduceRegex
-
mapString
-