Class ExtractorSWF

All Implemented Interfaces:
Checkpointable, HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class ExtractorSWF extends ContentExtractor
Extracts URIs from SWF (flash/shockwave) files. To test, here is a link to an swf that has links embedded inside of it: http://www.hitspring.com/index.swf.
Author:
Igor Ranitovic
  • Field Details

    • JSSTRING

      protected static final String JSSTRING
      See Also:
    • extractorJS

      protected transient ExtractorJS extractorJS
      Javascript extractor to use to process inline javascript. Autowired if available. If null, links will not be extracted from inline javascript.
  • Constructor Details

    • ExtractorSWF

      public ExtractorSWF()
  • Method Details

    • getExtractorJS

      public ExtractorJS getExtractorJS()
    • setExtractorJS

      @Autowired public void setExtractorJS(ExtractorJS extractorJS)
    • shouldExtract

      protected boolean shouldExtract(CrawlURI uri)
      Description copied from class: ContentExtractor
      Determines if otherwise valid URIs should have links extracted or not. The given URI will have content length greater than zero. Subclasses should implement this method to perform additional checks. For instance, the ExtractorHTML implementation checks that the content-type of the given URI is text/html.
      Specified by:
      shouldExtract in class ContentExtractor
      Parameters:
      uri - the URI to check
      Returns:
      true if links should be extracted from that URI, false otherwise
    • innerExtract

      protected boolean innerExtract(CrawlURI curi)
      Description copied from class: ContentExtractor
      Actually extracts links. The given URI will have passed the three checks described in ContentExtractor.shouldProcess(CrawlURI). Subclasses should implement this method to discover outlinks in the URI's content stream. For instance, ExtractorHTML extracts links from Anchor tags and so on.

      This method should only return true if extraction completed successfully. If not (for instance, if an IO error occurred), then this method should return false. Returning false indicates to the pipeline that downstream extractors should attempt to extract links themselves. Returning true indicates that downstream extractors should be skipped.

      Specified by:
      innerExtract in class ContentExtractor
      Parameters:
      curi - the URI whose links to extract
      Returns:
      true if link extraction finished; false if downstream extractors should attempt to extract links