Package org.archive.modules.extractor
Class ExtractorImpliedURI
java.lang.Object
org.archive.modules.Processor
org.archive.modules.extractor.Extractor
org.archive.modules.extractor.ExtractorImpliedURI
- All Implemented Interfaces:
org.archive.checkpointing.Checkpointable
,org.archive.spring.HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
public class ExtractorImpliedURI extends Extractor
An extractor for finding 'implied' URIs inside other URIs. If the
'trigger' regex is matched, a new URI will be constructed from the
'build' replacement pattern.
Unlike most other extractors, this works on URIs discovered by
previous extractors. Thus it should appear near the end of any
set of extractors.
Initially, only finds absolute HTTP(S) URIs in query-string or its
parameters.
TODO: extend to find URIs in path-info
- Author:
- Gordon Mohr
-
Field Summary
Fields inherited from class org.archive.modules.extractor.Extractor
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
-
Constructor Summary
Constructors Constructor Description ExtractorImpliedURI()
Constructor. -
Method Summary
Modifier and Type Method Description void
extract(CrawlURI curi)
Perform usual extraction on a CrawlURIprotected static String
extractImplied(CharSequence uri, Pattern trigger, String build)
Utility method for extracting 'implied' URI given a source uri, trigger pattern, and build pattern.String
getFormat()
Pattern
getRegex()
boolean
getRemoveTriggerUris()
void
setFormat(String format)
Replacement pattern to build 'implied' URI, using captured groups of trigger expression.void
setRegex(Pattern regex)
Triggering regular expression.void
setRemoveTriggerUris(boolean remove)
If true, all URIs that match trigger regular expression are removed from the list of extracted URIs.protected boolean
shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this processor.Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
-
Constructor Details
-
ExtractorImpliedURI
public ExtractorImpliedURI()Constructor.
-
-
Method Details
-
getRegex
-
setRegex
Triggering regular expression. When a discovered URI matches this pattern, the 'implied' URI will be built. The capturing groups of this expression are available for the build replacement pattern. -
getFormat
-
setFormat
Replacement pattern to build 'implied' URI, using captured groups of trigger expression. -
getRemoveTriggerUris
public boolean getRemoveTriggerUris() -
setRemoveTriggerUris
public void setRemoveTriggerUris(boolean remove)If true, all URIs that match trigger regular expression are removed from the list of extracted URIs. Default is false. -
shouldProcess
Description copied from class:Processor
Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.- Specified by:
shouldProcess
in classProcessor
- Parameters:
uri
- the URI to test- Returns:
- true if this processor should process that uri; false if not
-
extract
Perform usual extraction on a CrawlURI -
extractImplied
Utility method for extracting 'implied' URI given a source uri, trigger pattern, and build pattern.- Parameters:
uri
- source to check for implied URItrigger
- regex pattern which if matched implies another URIbuild
- replacement pattern to build the implied URI- Returns:
- implied URI, or null if none
-