Class ExtractorImpliedURI

All Implemented Interfaces:
org.archive.checkpointing.Checkpointable, org.archive.spring.HasKeyedProperties, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.Lifecycle

public class ExtractorImpliedURI
extends Extractor
An extractor for finding 'implied' URIs inside other URIs. If the 'trigger' regex is matched, a new URI will be constructed from the 'build' replacement pattern. Unlike most other extractors, this works on URIs discovered by previous extractors. Thus it should appear near the end of any set of extractors. Initially, only finds absolute HTTP(S) URIs in query-string or its parameters. TODO: extend to find URIs in path-info
Author:
Gordon Mohr
  • Constructor Details

    • ExtractorImpliedURI

      public ExtractorImpliedURI()
      Constructor.
  • Method Details

    • getRegex

      public Pattern getRegex()
    • setRegex

      public void setRegex​(Pattern regex)
      Triggering regular expression. When a discovered URI matches this pattern, the 'implied' URI will be built. The capturing groups of this expression are available for the build replacement pattern.
    • getFormat

      public String getFormat()
    • setFormat

      public void setFormat​(String format)
      Replacement pattern to build 'implied' URI, using captured groups of trigger expression.
    • getRemoveTriggerUris

      public boolean getRemoveTriggerUris()
    • setRemoveTriggerUris

      public void setRemoveTriggerUris​(boolean remove)
      If true, all URIs that match trigger regular expression are removed from the list of extracted URIs. Default is false.
    • shouldProcess

      protected boolean shouldProcess​(CrawlURI uri)
      Description copied from class: Processor
      Determines whether the given uri should be processed by this processor. For instance, a processor that only works on HTML content might reject the URI if its content type is not "text/html", if its content length is zero, and so on.
      Specified by:
      shouldProcess in class Processor
      Parameters:
      uri - the URI to test
      Returns:
      true if this processor should process that uri; false if not
    • extract

      public void extract​(CrawlURI curi)
      Perform usual extraction on a CrawlURI
      Specified by:
      extract in class Extractor
      Parameters:
      curi - Crawl URI to process.
    • extractImplied

      protected static String extractImplied​(CharSequence uri, Pattern trigger, String build)
      Utility method for extracting 'implied' URI given a source uri, trigger pattern, and build pattern.
      Parameters:
      uri - source to check for implied URI
      trigger - regex pattern which if matched implies another URI
      build - replacement pattern to build the implied URI
      Returns:
      implied URI, or null if none