Package org.archive.modules
Class ScriptedProcessor
java.lang.Object
org.archive.modules.Processor
org.archive.modules.ScriptedProcessor
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.beans.factory.InitializingBean
,org.springframework.context.ApplicationContextAware
,org.springframework.context.Lifecycle
public class ScriptedProcessor extends Processor implements org.springframework.context.ApplicationContextAware, org.springframework.beans.factory.InitializingBean
A processor which runs a JSR-223 script on the CrawlURI.
Script source may be provided via a file local to the crawler or
an inline configuration string.
The source must include a function "run()" taking one argument.
Each processed CrawlURI is passed to this script function.
Other variables available to the script include 'self' (this
ScriptedProcessor instance) and 'context' (the crawl's
ApplicationContext instance, from which all named beans are
reachable).
TODO: provide way to trigger reload of script mid-crawl; perhaps
by watching for a certain applicationEvent?
- Version:
- $Date$, $Revision$
- Author:
- gojomo
-
Field Summary
Fields Modifier and Type Field Description protected org.springframework.context.ApplicationContext
appCtx
protected String
engineName
engine name; default "beanshell"protected boolean
isolateThreads
Whether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine.protected org.archive.io.ReadSource
scriptSource
protected ScriptEngine
sharedEngine
protected ThreadLocal<ScriptEngine>
threadEngine
-
Constructor Summary
Constructors Constructor Description ScriptedProcessor()
Constructor. -
Method Summary
Modifier and Type Method Description void
afterPropertiesSet()
protected ScriptEngine
getEngine()
Get the proper ScriptEngine instance -- either shared or local to this thread.String
getEngineName()
boolean
getIsolateThreads()
org.archive.io.ReadSource
getScriptSource()
protected void
innerProcess(CrawlURI curi)
Actually performs the process.protected ScriptEngine
newEngine()
Create a newScriptEngine
instance, preloaded with any supplied source file and the variables 'self' (thisScriptedProcessor
) and 'context' (theApplicationContext
).void
setApplicationContext(org.springframework.context.ApplicationContext applicationContext)
void
setEngineName(String name)
void
setIsolateThreads(boolean isolateThreads)
void
setScriptSource(org.archive.io.ReadSource source)
protected boolean
shouldProcess(CrawlURI curi)
Determines whether the given uri should be processed by this processor.Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, fromCheckpointJson, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop, toCheckpointJson
-
Field Details
-
engineName
engine name; default "beanshell" -
scriptSource
protected org.archive.io.ReadSource scriptSource -
isolateThreads
protected boolean isolateThreadsWhether each ToeThread should get its own independent script engine, or they should share synchronized access to one engine. Default is true, meaning each thread gets its own isolated engine. -
appCtx
protected org.springframework.context.ApplicationContext appCtx -
threadEngine
-
-
Constructor Details
-
ScriptedProcessor
public ScriptedProcessor()Constructor.
-
-
Method Details
-
getEngineName
-
setEngineName
-
getScriptSource
public org.archive.io.ReadSource getScriptSource() -
setScriptSource
public void setScriptSource(org.archive.io.ReadSource source) -
getIsolateThreads
public boolean getIsolateThreads() -
setIsolateThreads
public void setIsolateThreads(boolean isolateThreads) -
setApplicationContext
public void setApplicationContext(org.springframework.context.ApplicationContext applicationContext) throws org.springframework.beans.BeansException- Specified by:
setApplicationContext
in interfaceorg.springframework.context.ApplicationContextAware
- Throws:
org.springframework.beans.BeansException
-
afterPropertiesSet
- Specified by:
afterPropertiesSet
in interfaceorg.springframework.beans.factory.InitializingBean
- Throws:
Exception
-
shouldProcess
Description copied from class:Processor
Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.- Specified by:
shouldProcess
in classProcessor
- Parameters:
curi
- the URI to test- Returns:
- true if this processor should process that uri; false if not
-
innerProcess
Description copied from class:Processor
Actually performs the process. By the time this method is invoked, it is known that the given URI passes theProcessor.getEnabled()
, theProcessor.getShouldProcessRule()
and theProcessor.shouldProcess(CrawlURI)
tests.- Specified by:
innerProcess
in classProcessor
- Parameters:
curi
- the URI to process
-
getEngine
Get the proper ScriptEngine instance -- either shared or local to this thread.- Returns:
- ScriptEngine to use
-
newEngine
Create a newScriptEngine
instance, preloaded with any supplied source file and the variables 'self' (thisScriptedProcessor
) and 'context' (theApplicationContext
).- Returns:
- the new ScriptEngine instance
-