Package org.archive.modules
Class Processor
java.lang.Object
org.archive.modules.Processor
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
- Direct Known Subclasses:
AbstractPersistProcessor
,ContentDigestHistoryLoader
,ContentDigestHistoryStorer
,Extractor
,FetchDNS
,FetchFTP
,FetchHistoryProcessor
,FetchHTTP
,FetchSFTP
,FetchWhois
,FormLoginProcessor
,HTTPContentDigest
,Kw3WriterProcessor
,MirrorWriterProcessor
,ScriptedProcessor
,WriterPoolProcessor
public abstract class Processor extends Object implements org.archive.spring.HasKeyedProperties, org.springframework.context.Lifecycle, org.springframework.beans.factory.BeanNameAware, org.archive.checkpointing.Checkpointable
A processor of URIs. The URI provides the context for the process;
settings can be altered based on the URI.
- Author:
- pjack
-
Field Summary
Fields Modifier and Type Field Description protected String
beanName
protected boolean
isRunning
protected org.archive.spring.KeyedProperties
kp
protected org.archive.checkpointing.Checkpoint
recoveryCheckpoint
protected AtomicLong
uriCount
The number of URIs processed by this processor. -
Constructor Summary
Constructors Constructor Description Processor()
-
Method Summary
Modifier and Type Method Description void
doCheckpoint(org.archive.checkpointing.Checkpoint checkpointInProgress)
void
finishCheckpoint(org.archive.checkpointing.Checkpoint checkpointInProgress)
static String
flattenVia(CrawlURI puri)
protected void
fromCheckpointJson(org.json.JSONObject json)
Restore internal state from JSONObject stored at earlier checkpoint-time.String
getBeanName()
boolean
getEnabled()
org.archive.spring.KeyedProperties
getKeyedProperties()
static long
getRecordedSize(CrawlURI puri)
DecideRule
getShouldProcessRule()
long
getURICount()
Returns the number of URIs this processor has handled.static boolean
hasHttpAuthenticationCredential(CrawlURI puri)
protected abstract void
innerProcess(CrawlURI uri)
Actually performs the process.protected ProcessResult
innerProcessResult(CrawlURI uri)
protected void
innerRejectProcess(CrawlURI uri)
Invoked after a URI has been rejected.boolean
isRunning()
static boolean
isSuccess(CrawlURI puri)
ProcessResult
process(CrawlURI uri)
Processes the given URI.String
report()
void
setBeanName(String name)
void
setEnabled(boolean enabled)
Whether or not this process will execute for a particular URI.void
setRecoveryCheckpoint(org.archive.checkpointing.Checkpoint checkpoint)
void
setShouldProcessRule(DecideRule rule)
Decide rule(s) (also particular to a URI) that determine whether or not a particular URI is processed here.protected abstract boolean
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this processor.void
start()
void
startCheckpoint(org.archive.checkpointing.Checkpoint checkpointInProgress)
void
stop()
protected org.json.JSONObject
toCheckpointJson()
Return a JSONObject of current stat that can be consulted on recovery to restore necessary values.
-
Field Details
-
kp
protected org.archive.spring.KeyedProperties kp -
beanName
-
uriCount
The number of URIs processed by this processor. -
isRunning
protected boolean isRunning -
recoveryCheckpoint
protected org.archive.checkpointing.Checkpoint recoveryCheckpoint
-
-
Constructor Details
-
Processor
public Processor()
-
-
Method Details
-
getKeyedProperties
public org.archive.spring.KeyedProperties getKeyedProperties()- Specified by:
getKeyedProperties
in interfaceorg.archive.spring.HasKeyedProperties
-
getBeanName
-
setBeanName
- Specified by:
setBeanName
in interfaceorg.springframework.beans.factory.BeanNameAware
-
getEnabled
public boolean getEnabled() -
setEnabled
public void setEnabled(boolean enabled)Whether or not this process will execute for a particular URI. If this is false for a URI, then the URI isn't processed, regardless of what the DecideRules say. -
getShouldProcessRule
-
setShouldProcessRule
Decide rule(s) (also particular to a URI) that determine whether or not a particular URI is processed here. If the rule(s) answer REJECT, processing is skipped. (ACCEPT or PASS allow processing to continue). -
process
Processes the given URI. First checksgetEnabled()
andgetShouldProcessRule()
. If getEnabled() returns false, then nothing happens. If the shouldProcessRule indicates REJECT, then theinnerRejectProcess(CrawlURI)
method is invoked, and the process method returns.Next, the
shouldProcess(CrawlURI)
method is consulted to see if this Processor knows how to handle the given URI. If it returns false, then nothing futher occurs.FIXME: Should innerRejectProcess be called when ENABLED is false, or when shouldProcess returns false? The previous Processor implementation didn't handle it that way.
Otherwise, the URI is considered valid. This processor's count of handled URIs is incremented, and the
innerProcess(CrawlURI)
method is invoked to actually perform the process.- Parameters:
uri
- The URI to process- Throws:
InterruptedException
- if the thread is interrupted
-
getURICount
public long getURICount()Returns the number of URIs this processor has handled. The returned number does not include URIs that were rejected by thegetEnabled()
flag, by thegetShouldProcessRule()
, or by theshouldProcess(CrawlURI)
method.- Returns:
- the number of URIs this processor has handled
-
shouldProcess
Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.- Parameters:
uri
- the URI to test- Returns:
- true if this processor should process that uri; false if not
-
innerProcessResult
- Throws:
InterruptedException
-
innerProcess
Actually performs the process. By the time this method is invoked, it is known that the given URI passes thegetEnabled()
, thegetShouldProcessRule()
and theshouldProcess(CrawlURI)
tests.- Parameters:
uri
- the URI to process- Throws:
InterruptedException
- if the thread is interrupted
-
innerRejectProcess
Invoked after a URI has been rejected. The default implementation does nothing; subclasses may override to log rejects or something.- Parameters:
uri
- the URI that was rejected- Throws:
InterruptedException
- if the thread is interrupted
-
flattenVia
-
isSuccess
-
getRecordedSize
-
hasHttpAuthenticationCredential
- Returns:
- True if we have an HttpAuthentication (rfc2617) payload.
-
report
-
isRunning
public boolean isRunning()- Specified by:
isRunning
in interfaceorg.springframework.context.Lifecycle
-
start
public void start()- Specified by:
start
in interfaceorg.springframework.context.Lifecycle
-
stop
public void stop()- Specified by:
stop
in interfaceorg.springframework.context.Lifecycle
-
startCheckpoint
public void startCheckpoint(org.archive.checkpointing.Checkpoint checkpointInProgress)- Specified by:
startCheckpoint
in interfaceorg.archive.checkpointing.Checkpointable
-
doCheckpoint
public void doCheckpoint(org.archive.checkpointing.Checkpoint checkpointInProgress) throws IOException- Specified by:
doCheckpoint
in interfaceorg.archive.checkpointing.Checkpointable
- Throws:
IOException
-
toCheckpointJson
protected org.json.JSONObject toCheckpointJson() throws org.json.JSONExceptionReturn a JSONObject of current stat that can be consulted on recovery to restore necessary values.- Returns:
- JSONObject
- Throws:
org.json.JSONException
-
fromCheckpointJson
protected void fromCheckpointJson(org.json.JSONObject json) throws org.json.JSONExceptionRestore internal state from JSONObject stored at earlier checkpoint-time.- Parameters:
json
- JSONObject- Throws:
org.json.JSONException
-
finishCheckpoint
public void finishCheckpoint(org.archive.checkpointing.Checkpoint checkpointInProgress)- Specified by:
finishCheckpoint
in interfaceorg.archive.checkpointing.Checkpointable
-
setRecoveryCheckpoint
@Autowired(required=false) public void setRecoveryCheckpoint(org.archive.checkpointing.Checkpoint checkpoint)- Specified by:
setRecoveryCheckpoint
in interfaceorg.archive.checkpointing.Checkpointable
-