org.archive.modules.extractor.ExtractorSWF

All Implemented Interfaces:: Checkpointable, HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class ExtractorSWF extends ContentExtractor

Extracts URIs from SWF (flash/shockwave) files. To test, here is a link to an swf that has links embedded inside of it: http://www.hitspring.com/index.swf.

Author:: Igor Ranitovic

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

class

ExtractorSWF.CrawlUriSWFAction

SWF action that handles discovered URIs.

protected class

ExtractorSWF.ExtractorTagParser

TagParser customized to ignore SWFTags that will never contain extractable URIs.
Field Summary

Fields

Modifier and Type

Field

Description

protected ExtractorJS

extractorJS

Javascript extractor to use to process inline javascript.

protected static final String

JSSTRING

Fields inherited from class org.archive.modules.extractor.Extractor
DEFAULT_PARAMETERS, extractorParameters, loggerModule, numberOfLinksExtracted

Fields inherited from class org.archive.modules.Processor
beanName, isRunning, kp, recoveryCheckpoint, uriCount
Constructor Summary

Constructors

Constructor

Description

ExtractorSWF()
Method Summary

Modifier and Type

Method

Description

ExtractorJS

getExtractorJS()

protected boolean

innerExtract(CrawlURI curi)

Actually extracts links.

void

setExtractorJS(ExtractorJS extractorJS)

protected boolean

shouldExtract(CrawlURI uri)

Determines if otherwise valid URIs should have links extracted or not.

Methods inherited from class org.archive.modules.extractor.ContentExtractor
extract, shouldProcess

Methods inherited from class org.archive.modules.extractor.Extractor
add, addOutlink, addOutlink, addRelativeToBase, addRelativeToVia, fromCheckpointJson, getExtractorParameters, getLoggerModule, innerProcess, logUriError, report, setExtractorParameters, setLoggerModule, toCheckpointJson

Methods inherited from class org.archive.modules.Processor
doCheckpoint, finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, innerProcessResult, innerRejectProcess, isRunning, isSuccess, process, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, start, startCheckpoint, stop

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- JSSTRING
  
  protected static final String JSSTRING
  See Also:
  
  Constant Field Values
- extractorJS
  
  protected transient ExtractorJS extractorJS
  
  Javascript extractor to use to process inline javascript. Autowired if available. If null, links will not be extracted from inline javascript.
Constructor Details
- ExtractorSWF
  
  public ExtractorSWF()
Method Details
- getExtractorJS
  
  public ExtractorJS getExtractorJS()
- setExtractorJS
  
  @Autowired public void setExtractorJS(ExtractorJS extractorJS)
- shouldExtract
  
  protected boolean shouldExtract(CrawlURI uri)
  
  Description copied from class: ContentExtractor
  
  Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
  
  Specified by:
  
  shouldExtract in class ContentExtractor
  
  Parameters:
  
  uri - the URI to check
  
  Returns:
  
  true if links should be extracted from that URI, false otherwise
- innerExtract
  
  protected boolean innerExtract(CrawlURI curi)
  
  Description copied from class: ContentExtractor
  
  Actually extracts links. The given URI will have passed the three checks described in ContentExtractor.shouldProcess(CrawlURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.
  This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.
  
  Specified by:
  
  innerExtract in class ContentExtractor
  
  Parameters:
  
  curi - the URI whose links to extract
  
  Returns:
  
  true if link extraction finished; false if downstream extractors should attempt to extract links

Class ExtractorSWF

Nested Class Summary

Field Summary

Fields inherited from class org.archive.modules.extractor.Extractor

Fields inherited from class org.archive.modules.Processor

Constructor Summary

Method Summary

Methods inherited from class org.archive.modules.extractor.ContentExtractor

Methods inherited from class org.archive.modules.extractor.Extractor

Methods inherited from class org.archive.modules.Processor

Methods inherited from class java.lang.Object

Field Details

JSSTRING

extractorJS

Constructor Details

ExtractorSWF

Method Details

getExtractorJS

setExtractorJS

shouldExtract

innerExtract