Class AdaptiveScheduler


  • public class AdaptiveScheduler
    extends DefaultScheduler
    Adaptive fetch scheduler, checks by signature comparison whether a re-fetched page has changed:
    • if yes, shrink the fetch interval up to a minimum fetch interval
    • if not, increase the fetch interval up to a maximum

    The rate how the fetch interval is incremented or decremented is configurable.

    Note, that this scheduler requires the following metadata:

    signature
    page signature, filled by MD5SignatureParseFilter
    signatureOld
    (temporary) copy of the previous signature, optionally copied by MD5SignatureParseFilter
    fetch.statusCode
    HTTP response status code, required to handle "HTTP 304 Not Modified" responses
    and writes the following metadata fields:
    fetchInterval
    current fetch interval
    signatureChangeDate
    date when the signature has changed (ISO-8601 date time format)
    last-modified
    last-modified time used to send If-Modified-Since HTTP requests, only written if scheduler.adaptive.setLastModified is true. Same date string as set in "signatureChangeDate". Note that it is assumed that the metadata field `last-modified` is written only by the scheduler, in detail, the property `protocol.md.prefix` should not be empty to avoid that `last-modified` is filled with an incorrect or ill-formed date from the HTTP header.

    Configuration

    The following lines show how to configure the adaptive scheduler in the configuration file (crawler-conf.yaml):

     scheduler.class: "com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler"
     # set last-modified time ("last-modified") used in HTTP If-Modified-Since request header field
     scheduler.adaptive.setLastModified: true
     # min. interval in minutes (default: 1h)
     scheduler.adaptive.fetchInterval.min: 60
     # max. interval in minutes (default: 2 weeks)
     scheduler.adaptive.fetchInterval.max: 20160
     # increment and decrement rates (0.0 < rate <= 1.0)
     scheduler.adaptive.fetchInterval.rate.incr: .5
     scheduler.adaptive.fetchInterval.rate.decr: .5
    
     # required persisted metadata (in addition to other persisted metadata):
     metadata.persist:
      - ...
      - signature
      - fetch.statusCode
      - fetchInterval
      - last-modified
     # - signatureOld
     # - signatureChangeDate
     # Note: "signatureOld" and "signatureChangeDate" are optional, the adaptive
     # scheduler will also work if both are temporarily passed and not persisted.
     

    To generate the signature and keep a copy of the last signature, the parse filters should be configured accordingly:

     "com.digitalpebble.stormcrawler.parse.ParseFilters": [
       ...,
       {
         "class": "com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter",
         "name": "MD5Digest",
         "params": {
           "useText": "false",
           "keyName": "signature",
           "keyNameCopy": "signatureOld"
         }
       }
     
    The order is mandatory: first copy the old signature, than generate the current one.
    • Field Detail

      • SET_LAST_MODIFIED

        public static final String SET_LAST_MODIFIED
        Configuration property (boolean) whether or not to set the "last-modified" metadata field when a page change was detected by signature comparison.
        See Also:
        Constant Field Values
      • INTERVAL_MIN

        public static final String INTERVAL_MIN
        Configuration property (int) to set the minimum fetch interval in minutes.
        See Also:
        Constant Field Values
      • INTERVAL_MAX

        public static final String INTERVAL_MAX
        Configuration property (int) to set the maximum fetch interval in minutes.
        See Also:
        Constant Field Values
      • INTERVAL_INC_RATE

        public static final String INTERVAL_INC_RATE
        Configuration property (float) to set the increment rate. If a page hasn't changed when refetched, the fetch interval is multiplied by (1.0 + incr_rate) until the max. fetch interval is reached.
        See Also:
        Constant Field Values
      • INTERVAL_DEC_RATE

        public static final String INTERVAL_DEC_RATE
        Configuration property (float) to set the decrement rate. If a page has changed when refetched, the fetch interval is multiplied by (1.0 - decr_rate). If the fetch interval comes closer to the minimum interval, the decrementing is slowed down.
        See Also:
        Constant Field Values
      • SIGNATURE_KEY

        public static final String SIGNATURE_KEY
        Name of the signature key in metadata, must be defined as "keyName" in the configuration of MD5SignatureParseFilter . This key must be listed in "metadata.persist".
        See Also:
        Constant Field Values
      • SIGNATURE_OLD_KEY

        public static final String SIGNATURE_OLD_KEY
        Name of key to hold previous signature: a copy, not overwritten by MD5SignatureParseFilter. This key is a temporary copy, not necessarily persisted in metadata.
        See Also:
        Constant Field Values
      • FETCH_INTERVAL_KEY

        public static final String FETCH_INTERVAL_KEY
        Key to store the current fetch interval value, must be listed in "metadata.persist".
        See Also:
        Constant Field Values
      • SIGNATURE_MODIFIED_KEY

        public static final String SIGNATURE_MODIFIED_KEY
        Key to store the date when the signature has been changed, must be listed in "metadata.persist".
        See Also:
        Constant Field Values
      • defaultfetchInterval

        protected int defaultfetchInterval
      • minFetchInterval

        protected int minFetchInterval
      • maxFetchInterval

        protected int maxFetchInterval
      • fetchIntervalDecRate

        protected float fetchIntervalDecRate
      • fetchIntervalIncRate

        protected float fetchIntervalIncRate
      • setLastModified

        protected boolean setLastModified
      • overwriteLastModified

        protected boolean overwriteLastModified
    • Constructor Detail

      • AdaptiveScheduler

        public AdaptiveScheduler()
    • Method Detail

      • init

        public void init​(Map<String,​Object> stormConf)
        Description copied from class: Scheduler
        Configuration of the scheduler based on the config. Should be called by Scheduler.getInstance() *
        Overrides:
        init in class DefaultScheduler
      • schedule

        public Optional<Date> schedule​(Status status,
                                       Metadata metadata)
        Description copied from class: Scheduler
        Returns an optional Date indicating when the document should be refetched next, based on its status. It is empty if the URL should never be refetched.
        Overrides:
        schedule in class DefaultScheduler