Class ScriptedProcessor

java.lang.Object
org.archive.modules.Processor
org.archive.modules.ScriptedProcessor
All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.beans.factory.InitializingBean, org.springframework.context.ApplicationContextAware, org.springframework.context.Lifecycle

public class ScriptedProcessor
extends Processor
implements org.springframework.context.ApplicationContextAware, org.springframework.beans.factory.InitializingBean
A processor which runs a JSR-223 script on the CrawlURI. Script source may be provided via a file local to the crawler or an inline configuration string. The source must include a function "run()" taking one argument. Each processed CrawlURI is passed to this script function. Other variables available to the script include 'self' (this ScriptedProcessor instance) and 'context' (the crawl's ApplicationContext instance, from which all named beans are reachable). TODO: provide way to trigger reload of script mid-crawl; perhaps by watching for a certain applicationEvent?
Version:
$Date$, $Revision$
Author:
gojomo
  • Field Details

    • engineName

      protected String engineName
      engine name; default "beanshell"
    • scriptSource

      protected org.archive.io.ReadSource scriptSource
    • isolateThreads

      protected boolean isolateThreads
      Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine. Default is true, meaning each thread gets its own isolated engine.
    • appCtx

      protected org.springframework.context.ApplicationContext appCtx
    • threadEngine

      protected transient ThreadLocal<ScriptEngine> threadEngine
    • sharedEngine

      protected ScriptEngine sharedEngine
  • Constructor Details

    • ScriptedProcessor

      public ScriptedProcessor()
      Constructor.
  • Method Details

    • getEngineName

      public String getEngineName()
    • setEngineName

      public void setEngineName​(String name)
    • getScriptSource

      public org.archive.io.ReadSource getScriptSource()
    • setScriptSource

      public void setScriptSource​(org.archive.io.ReadSource source)
    • getIsolateThreads

      public boolean getIsolateThreads()
    • setIsolateThreads

      public void setIsolateThreads​(boolean isolateThreads)
    • setApplicationContext

      public void setApplicationContext​(org.springframework.context.ApplicationContext applicationContext) throws org.springframework.beans.BeansException
      Specified by:
      setApplicationContext in interface org.springframework.context.ApplicationContextAware
      Throws:
      org.springframework.beans.BeansException
    • afterPropertiesSet

      public void afterPropertiesSet() throws Exception
      Specified by:
      afterPropertiesSet in interface org.springframework.beans.factory.InitializingBean
      Throws:
      Exception
    • shouldProcess

      protected boolean shouldProcess​(CrawlURI curi)
      Description copied from class: Processor
      Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.
      Specified by:
      shouldProcess in class Processor
      Parameters:
      curi - the URI to test
      Returns:
      true if this processor should process that uri; false if not
    • innerProcess

      protected void innerProcess​(CrawlURI curi)
      Description copied from class: Processor
      Actually performs the process. By the time this method is invoked, it is known that the given URI passes the Processor.getEnabled(), the Processor.getShouldProcessRule() and the Processor.shouldProcess(CrawlURI) tests.
      Specified by:
      innerProcess in class Processor
      Parameters:
      curi - the URI to process
    • getEngine

      protected ScriptEngine getEngine()
      Get the proper ScriptEngine instance -- either shared or local to this thread.
      Returns:
      ScriptEngine to use
    • newEngine

      protected ScriptEngine newEngine()
      Create a new ScriptEngine instance, preloaded with any supplied source file and the variables 'self' (this ScriptedProcessor) and 'context' (the ApplicationContext).
      Returns:
      the new ScriptEngine instance