Class DispositionProcessor

java.lang.Object
org.archive.modules.Processor
org.archive.crawler.postprocessor.DispositionProcessor
All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class DispositionProcessor extends Processor
A step, late in the processing of a CrawlURI, for marking-up the CrawlURI with values to affect frontier disposition, and updating information that may have been affected by the fetch. This includes robots info and other stats. (Formerly called CrawlStateUpdater, when it did less.)
Version:
$Date$, $Revision$
Author:
gojomo
  • Field Details

  • Constructor Details

    • DispositionProcessor

      public DispositionProcessor()
  • Method Details

    • getServerCache

      public ServerCache getServerCache()
    • setServerCache

      @Autowired public void setServerCache(ServerCache serverCache)
    • getDelayFactor

      public float getDelayFactor()
    • setDelayFactor

      public void setDelayFactor(float factor)
      How many multiples of last fetch elapsed time to wait before recontacting same server.
    • getMinDelayMs

      public int getMinDelayMs()
    • setMinDelayMs

      public void setMinDelayMs(int minDelay)
      always wait this long after one completion before recontacting same server, regardless of multiple
    • getRespectCrawlDelayUpToSeconds

      public int getRespectCrawlDelayUpToSeconds()
    • setRespectCrawlDelayUpToSeconds

      public void setRespectCrawlDelayUpToSeconds(int respect)
      Whether to respect a 'Crawl-Delay' (in seconds) given in a site's robots.txt
    • getMaxDelayMs

      public int getMaxDelayMs()
    • setMaxDelayMs

      public void setMaxDelayMs(int maxDelay)
      never wait more than this long, regardless of multiple
    • getMaxPerHostBandwidthUsageKbSec

      public int getMaxPerHostBandwidthUsageKbSec()
    • setMaxPerHostBandwidthUsageKbSec

      public void setMaxPerHostBandwidthUsageKbSec(int max)
      maximum per-host bandwidth usage
    • getForceRetire

      public boolean getForceRetire()
    • setForceRetire

      public void setForceRetire(boolean force)
      Whether to set a CrawlURI's force-retired directive, retiring its queue when it finishes. Mainly intended for URI-specific overlay settings; setting true globally will just retire all queues after they offer one URI, rapidly ending a crawl.
    • getMetadata

      public CrawlMetadata getMetadata()
    • setMetadata

      @Autowired public void setMetadata(CrawlMetadata provider)
      Auto-discovered module providing configured (or overridden) User-Agent value and RobotsHonoringPolicy
    • shouldProcess

      protected boolean shouldProcess(CrawlURI puri)
      Specified by:
      shouldProcess in class Processor
    • innerProcess

      protected void innerProcess(CrawlURI curi)
      Specified by:
      innerProcess in class Processor
    • politenessDelayFor

      protected long politenessDelayFor(CrawlURI curi)
      Update any scheduling structures with the new information in this CrawlURI. Chiefly means make necessary arrangements for no other URIs at the same host to be visited within the appropriate politeness window.
      Parameters:
      curi - The CrawlURI
      Returns:
      millisecond politeness delay