Class FetchHistoryProcessor

java.lang.Object
org.archive.modules.Processor
org.archive.modules.recrawl.FetchHistoryProcessor
All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class FetchHistoryProcessor
extends Processor
Maintain a history of fetch information inside the CrawlURI's attributes.
Version:
$Date: 2006-09-25 20:19:54 +0000 (Mon, 25 Sep 2006) $, $Revision: 4654 $
Author:
gojomo
  • Field Details

    • historyLength

      protected int historyLength
      Desired history array length.
  • Constructor Details

    • FetchHistoryProcessor

      public FetchHistoryProcessor()
  • Method Details

    • getHistoryLength

      public int getHistoryLength()
    • setHistoryLength

      public void setHistoryLength​(int length)
    • innerProcess

      protected void innerProcess​(CrawlURI puri) throws InterruptedException
      Description copied from class: Processor
      Actually performs the process. By the time this method is invoked, it is known that the given URI passes the Processor.getEnabled(), the Processor.getShouldProcessRule() and the Processor.shouldProcess(CrawlURI) tests.
      Specified by:
      innerProcess in class Processor
      Parameters:
      puri - the URI to process
      Throws:
      InterruptedException - if the thread is interrupted
    • hasIdenticalDigest

      public static boolean hasIdenticalDigest​(CrawlURI curi)
      Utility method for testing if a CrawlURI's last two history entries (one being the most recent fetch) have identical content-digest information.
      Parameters:
      curi - CrawlURI to test
      Returns:
      true if last two history entries have identical digests, otherwise false
    • historyRealloc

      protected HashMap<String,​Object>[] historyRealloc​(CrawlURI curi)
      Get or create proper-sized history array
    • saveHeader

      protected void saveHeader​(CrawlURI curi, Map<String,​Object> map, String key)
      Save a header from the given HTTP operation into the Map.
    • shouldProcess

      protected boolean shouldProcess​(CrawlURI curi)
      Description copied from class: Processor
      Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.
      Specified by:
      shouldProcess in class Processor
      Parameters:
      curi - the URI to test
      Returns:
      true if this processor should process that uri; false if not