Class Processor

java.lang.Object
org.archive.modules.Processor
All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle
Direct Known Subclasses:
AbstractPersistProcessor, ContentDigestHistoryLoader, ContentDigestHistoryStorer, Extractor, FetchDNS, FetchFTP, FetchHistoryProcessor, FetchHTTP, FetchSFTP, FetchWhois, FormLoginProcessor, HTTPContentDigest, Kw3WriterProcessor, MirrorWriterProcessor, ScriptedProcessor, WriterPoolProcessor

public abstract class Processor
extends Object
implements org.archive.spring.HasKeyedProperties, org.springframework.context.Lifecycle, org.springframework.beans.factory.BeanNameAware, org.archive.checkpointing.Checkpointable
A processor of URIs. The URI provides the context for the process; settings can be altered based on the URI.
Author:
pjack
  • Field Details

    • kp

      protected org.archive.spring.KeyedProperties kp
    • beanName

      protected String beanName
    • uriCount

      protected AtomicLong uriCount
      The number of URIs processed by this processor.
    • isRunning

      protected boolean isRunning
    • recoveryCheckpoint

      protected org.archive.checkpointing.Checkpoint recoveryCheckpoint
  • Constructor Details

    • Processor

      public Processor()
  • Method Details

    • getKeyedProperties

      public org.archive.spring.KeyedProperties getKeyedProperties()
      Specified by:
      getKeyedProperties in interface org.archive.spring.HasKeyedProperties
    • getBeanName

      public String getBeanName()
    • setBeanName

      public void setBeanName​(String name)
      Specified by:
      setBeanName in interface org.springframework.beans.factory.BeanNameAware
    • getEnabled

      public boolean getEnabled()
    • setEnabled

      public void setEnabled​(boolean enabled)
      Whether or not this process will execute for a particular URI. If this is false for a URI, then the URI isn't processed, regardless of what the DecideRules say.
    • getShouldProcessRule

      public DecideRule getShouldProcessRule()
    • setShouldProcessRule

      public void setShouldProcessRule​(DecideRule rule)
      Decide rule(s) (also particular to a URI) that determine whether or not a particular URI is processed here. If the rule(s) answer REJECT, processing is skipped. (ACCEPT or PASS allow processing to continue).
    • process

      public ProcessResult process​(CrawlURI uri) throws InterruptedException
      Processes the given URI. First checks getEnabled() and getShouldProcessRule(). If getEnabled() returns false, then nothing happens. If the shouldProcessRule indicates REJECT, then the innerRejectProcess(CrawlURI) method is invoked, and the process method returns.

      Next, the shouldProcess(CrawlURI) method is consulted to see if this Processor knows how to handle the given URI. If it returns false, then nothing futher occurs.

      FIXME: Should innerRejectProcess be called when ENABLED is false, or when shouldProcess returns false? The previous Processor implementation didn't handle it that way.

      Otherwise, the URI is considered valid. This processor's count of handled URIs is incremented, and the innerProcess(CrawlURI) method is invoked to actually perform the process.

      Parameters:
      uri - The URI to process
      Throws:
      InterruptedException - if the thread is interrupted
    • getURICount

      public long getURICount()
      Returns the number of URIs this processor has handled. The returned number does not include URIs that were rejected by the getEnabled() flag, by the getShouldProcessRule(), or by the shouldProcess(CrawlURI) method.
      Returns:
      the number of URIs this processor has handled
    • shouldProcess

      protected abstract boolean shouldProcess​(CrawlURI uri)
      Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.
      Parameters:
      uri - the URI to test
      Returns:
      true if this processor should process that uri; false if not
    • innerProcessResult

      protected ProcessResult innerProcessResult​(CrawlURI uri) throws InterruptedException
      Throws:
      InterruptedException
    • innerProcess

      protected abstract void innerProcess​(CrawlURI uri) throws InterruptedException
      Actually performs the process. By the time this method is invoked, it is known that the given URI passes the getEnabled(), the getShouldProcessRule() and the shouldProcess(CrawlURI) tests.
      Parameters:
      uri - the URI to process
      Throws:
      InterruptedException - if the thread is interrupted
    • innerRejectProcess

      protected void innerRejectProcess​(CrawlURI uri) throws InterruptedException
      Invoked after a URI has been rejected. The default implementation does nothing; subclasses may override to log rejects or something.
      Parameters:
      uri - the URI that was rejected
      Throws:
      InterruptedException - if the thread is interrupted
    • flattenVia

      public static String flattenVia​(CrawlURI puri)
    • isSuccess

      public static boolean isSuccess​(CrawlURI puri)
    • getRecordedSize

      public static long getRecordedSize​(CrawlURI puri)
    • hasHttpAuthenticationCredential

      public static boolean hasHttpAuthenticationCredential​(CrawlURI puri)
      Returns:
      True if we have an HttpAuthentication (rfc2617) payload.
    • report

      public String report()
    • isRunning

      public boolean isRunning()
      Specified by:
      isRunning in interface org.springframework.context.Lifecycle
    • start

      public void start()
      Specified by:
      start in interface org.springframework.context.Lifecycle
    • stop

      public void stop()
      Specified by:
      stop in interface org.springframework.context.Lifecycle
    • startCheckpoint

      public void startCheckpoint​(org.archive.checkpointing.Checkpoint checkpointInProgress)
      Specified by:
      startCheckpoint in interface org.archive.checkpointing.Checkpointable
    • doCheckpoint

      public void doCheckpoint​(org.archive.checkpointing.Checkpoint checkpointInProgress) throws IOException
      Specified by:
      doCheckpoint in interface org.archive.checkpointing.Checkpointable
      Throws:
      IOException
    • toCheckpointJson

      protected org.json.JSONObject toCheckpointJson() throws org.json.JSONException
      Return a JSONObject of current stat that can be consulted on recovery to restore necessary values.
      Returns:
      JSONObject
      Throws:
      org.json.JSONException
    • fromCheckpointJson

      protected void fromCheckpointJson​(org.json.JSONObject json) throws org.json.JSONException
      Restore internal state from JSONObject stored at earlier checkpoint-time.
      Parameters:
      json - JSONObject
      Throws:
      org.json.JSONException
    • finishCheckpoint

      public void finishCheckpoint​(org.archive.checkpointing.Checkpoint checkpointInProgress)
      Specified by:
      finishCheckpoint in interface org.archive.checkpointing.Checkpointable
    • setRecoveryCheckpoint

      @Autowired(required=false) public void setRecoveryCheckpoint​(org.archive.checkpointing.Checkpoint checkpoint)
      Specified by:
      setRecoveryCheckpoint in interface org.archive.checkpointing.Checkpointable