Class DispositionProcessor
java.lang.Object
org.archive.modules.Processor
org.archive.crawler.postprocessor.DispositionProcessor
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
A step, late in the processing of a CrawlURI, for marking-up the
CrawlURI with values to affect frontier disposition, and updating
information that may have been affected by the fetch. This includes
robots info and other stats.
(Formerly called CrawlStateUpdater, when it did less.)
- Version:
- $Date$, $Revision$
- Author:
- gojomo
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionfloat
boolean
int
int
int
int
protected void
innerProcess
(CrawlURI curi) protected long
politenessDelayFor
(CrawlURI curi) Update any scheduling structures with the new information in this CrawlURI.void
setDelayFactor
(float factor) How many multiples of last fetch elapsed time to wait before recontacting same server.void
setForceRetire
(boolean force) Whether to set a CrawlURI's force-retired directive, retiring its queue when it finishes.void
setMaxDelayMs
(int maxDelay) never wait more than this long, regardless of multiplevoid
setMaxPerHostBandwidthUsageKbSec
(int max) maximum per-host bandwidth usagevoid
setMetadata
(CrawlMetadata provider) Auto-discovered module providing configured (or overridden) User-Agent value and RobotsHonoringPolicyvoid
setMinDelayMs
(int minDelay) always wait this long after one completion before recontacting same server, regardless of multiplevoid
setRespectCrawlDelayUpToSeconds
(int respect) Whether to respect a 'Crawl-Delay' (in seconds) given in a site's robots.txtvoid
setServerCache
(ServerCache serverCache) protected boolean
shouldProcess
(CrawlURI puri) Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop, toCheckpointJson
-
Field Details
-
serverCache
-
metadata
-
-
Constructor Details
-
DispositionProcessor
public DispositionProcessor()
-
-
Method Details
-
getServerCache
-
setServerCache
-
getDelayFactor
public float getDelayFactor() -
setDelayFactor
public void setDelayFactor(float factor) How many multiples of last fetch elapsed time to wait before recontacting same server. -
getMinDelayMs
public int getMinDelayMs() -
setMinDelayMs
public void setMinDelayMs(int minDelay) always wait this long after one completion before recontacting same server, regardless of multiple -
getRespectCrawlDelayUpToSeconds
public int getRespectCrawlDelayUpToSeconds() -
setRespectCrawlDelayUpToSeconds
public void setRespectCrawlDelayUpToSeconds(int respect) Whether to respect a 'Crawl-Delay' (in seconds) given in a site's robots.txt -
getMaxDelayMs
public int getMaxDelayMs() -
setMaxDelayMs
public void setMaxDelayMs(int maxDelay) never wait more than this long, regardless of multiple -
getMaxPerHostBandwidthUsageKbSec
public int getMaxPerHostBandwidthUsageKbSec() -
setMaxPerHostBandwidthUsageKbSec
public void setMaxPerHostBandwidthUsageKbSec(int max) maximum per-host bandwidth usage -
getForceRetire
public boolean getForceRetire() -
setForceRetire
public void setForceRetire(boolean force) Whether to set a CrawlURI's force-retired directive, retiring its queue when it finishes. Mainly intended for URI-specific overlay settings; setting true globally will just retire all queues after they offer one URI, rapidly ending a crawl. -
getMetadata
-
setMetadata
Auto-discovered module providing configured (or overridden) User-Agent value and RobotsHonoringPolicy -
shouldProcess
- Specified by:
shouldProcess
in classProcessor
-
innerProcess
- Specified by:
innerProcess
in classProcessor
-
politenessDelayFor
Update any scheduling structures with the new information in this CrawlURI. Chiefly means make necessary arrangements for no other URIs at the same host to be visited within the appropriate politeness window.- Parameters:
curi
- The CrawlURI- Returns:
- millisecond politeness delay
-