Package org.archive.modules.extractor
Class ExtractorSWF
java.lang.Object
org.archive.modules.Processor
org.archive.modules.extractor.Extractor
org.archive.modules.extractor.ContentExtractor
org.archive.modules.extractor.ExtractorSWF
- All Implemented Interfaces:
Checkpointable
,HasKeyedProperties
,org.springframework.beans.factory.Aware
,org.springframework.beans.factory.BeanNameAware
,org.springframework.context.Lifecycle
Extracts URIs from SWF (flash/shockwave) files.
To test, here is a link to an swf that has links
embedded inside of it: http://www.hitspring.com/index.swf.
- Author:
- Igor Ranitovic
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionclass
SWF action that handles discovered URIs.protected class
TagParser customized to ignore SWFTags that will never contain extractable URIs. -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected ExtractorJS
Javascript extractor to use to process inline javascript.protected static final String
Fields inherited from class org.archive.modules.extractor.Extractor
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
innerExtract
(CrawlURI curi) Actually extracts links.void
setExtractorJS
(ExtractorJS extractorJS) protected boolean
shouldExtract
(CrawlURI uri) Determines if otherwise valid URIs should have links extracted or not.Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess
Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson
Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop
-
Field Details
-
JSSTRING
- See Also:
-
extractorJS
Javascript extractor to use to process inline javascript. Autowired if available. If null, links will not be extracted from inline javascript.
-
-
Constructor Details
-
ExtractorSWF
public ExtractorSWF()
-
-
Method Details
-
getExtractorJS
-
setExtractorJS
-
shouldExtract
Description copied from class:ContentExtractor
Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, theExtractorHTML
implementation checks that the content-type of the given URI is text/html.- Specified by:
shouldExtract
in classContentExtractor
- Parameters:
uri
- the URI to check- Returns:
- true if links should be extracted from that URI, false otherwise
-
innerExtract
Description copied from class:ContentExtractor
Actually extracts links. The given URI will have passed the three checks described inContentExtractor.shouldProcess(CrawlURI)
. Subclasses should implement this method to discover outlinks in the URI's content stream. For instance,ExtractorHTML
extracts links from Anchor tags and so on.This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
- Specified by:
innerExtract
in classContentExtractor
- Parameters:
curi
- the URI whose links to extract- Returns:
- true if link extraction finished; false if downstream extractors should attempt to extract links
-