Class AdaptiveScheduler
- java.lang.Object
-
- com.digitalpebble.stormcrawler.persistence.Scheduler
-
- com.digitalpebble.stormcrawler.persistence.DefaultScheduler
-
- com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
public class AdaptiveScheduler extends DefaultScheduler
Adaptive fetch scheduler, checks by signature comparison whether a re-fetched page has changed:- if yes, shrink the fetch interval up to a minimum fetch interval
- if not, increase the fetch interval up to a maximum
The rate how the fetch interval is incremented or decremented is configurable.
Note, that this scheduler requires the following metadata:
- signature
- page signature, filled by
MD5SignatureParseFilter
- signatureOld
- (temporary) copy of the previous signature, optionally copied by
MD5SignatureParseFilter
- fetch.statusCode
- HTTP response status code, required to handle "HTTP 304 Not Modified" responses
- fetchInterval
- current fetch interval
- signatureChangeDate
- date when the signature has changed (ISO-8601 date time format)
- last-modified
- last-modified time used to send If-Modified-Since HTTP requests, only written if
scheduler.adaptive.setLastModified
is true. Same date string as set in "signatureChangeDate". Note that it is assumed that the metadata field `last-modified` is written only by the scheduler, in detail, the property `protocol.md.prefix` should not be empty to avoid that `last-modified` is filled with an incorrect or ill-formed date from the HTTP header.
Configuration
The following lines show how to configure the adaptive scheduler in the configuration file (crawler-conf.yaml):
scheduler.class: "com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler" # set last-modified time ("last-modified") used in HTTP If-Modified-Since request header field scheduler.adaptive.setLastModified: true # min. interval in minutes (default: 1h) scheduler.adaptive.fetchInterval.min: 60 # max. interval in minutes (default: 2 weeks) scheduler.adaptive.fetchInterval.max: 20160 # increment and decrement rates (0.0 < rate <= 1.0) scheduler.adaptive.fetchInterval.rate.incr: .5 scheduler.adaptive.fetchInterval.rate.decr: .5 # required persisted metadata (in addition to other persisted metadata): metadata.persist: - ... - signature - fetch.statusCode - fetchInterval - last-modified # - signatureOld # - signatureChangeDate # Note: "signatureOld" and "signatureChangeDate" are optional, the adaptive # scheduler will also work if both are temporarily passed and not persisted.
To generate the signature and keep a copy of the last signature, the parse filters should be configured accordingly:
"com.digitalpebble.stormcrawler.parse.ParseFilters": [ ..., { "class": "com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter", "name": "MD5Digest", "params": { "useText": "false", "keyName": "signature", "keyNameCopy": "signatureOld" } }
The order is mandatory: first copy the old signature, than generate the current one.
-
-
Field Summary
Fields Modifier and Type Field Description protected int
defaultfetchInterval
static String
FETCH_INTERVAL_KEY
Key to store the current fetch interval value, must be listed in "metadata.persist".protected float
fetchIntervalDecRate
protected float
fetchIntervalIncRate
static String
INTERVAL_DEC_RATE
Configuration property (float) to set the decrement rate.static String
INTERVAL_INC_RATE
Configuration property (float) to set the increment rate.static String
INTERVAL_MAX
Configuration property (int) to set the maximum fetch interval in minutes.static String
INTERVAL_MIN
Configuration property (int) to set the minimum fetch interval in minutes.protected int
maxFetchInterval
protected int
minFetchInterval
protected boolean
overwriteLastModified
static String
SET_LAST_MODIFIED
Configuration property (boolean) whether or not to set the "last-modified" metadata field when a page change was detected by signature comparison.protected boolean
setLastModified
static String
SIGNATURE_KEY
Name of the signature key in metadata, must be defined as "keyName" in the configuration ofMD5SignatureParseFilter
.static String
SIGNATURE_MODIFIED_KEY
Key to store the date when the signature has been changed, must be listed in "metadata.persist".static String
SIGNATURE_OLD_KEY
Name of key to hold previous signature: a copy, not overwritten byMD5SignatureParseFilter
.-
Fields inherited from class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
DELAY_METADATA
-
Fields inherited from class com.digitalpebble.stormcrawler.persistence.Scheduler
schedulerClassParamName
-
-
Constructor Summary
Constructors Constructor Description AdaptiveScheduler()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
init(Map<String,Object> stormConf)
Configuration of the scheduler based on the config.Optional<Date>
schedule(Status status, Metadata metadata)
Returns an optional Date indicating when the document should be refetched next, based on its status.-
Methods inherited from class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
checkCustomInterval
-
Methods inherited from class com.digitalpebble.stormcrawler.persistence.Scheduler
getInstance
-
-
-
-
Field Detail
-
SET_LAST_MODIFIED
public static final String SET_LAST_MODIFIED
Configuration property (boolean) whether or not to set the "last-modified" metadata field when a page change was detected by signature comparison.- See Also:
- Constant Field Values
-
INTERVAL_MIN
public static final String INTERVAL_MIN
Configuration property (int) to set the minimum fetch interval in minutes.- See Also:
- Constant Field Values
-
INTERVAL_MAX
public static final String INTERVAL_MAX
Configuration property (int) to set the maximum fetch interval in minutes.- See Also:
- Constant Field Values
-
INTERVAL_INC_RATE
public static final String INTERVAL_INC_RATE
Configuration property (float) to set the increment rate. If a page hasn't changed when refetched, the fetch interval is multiplied by (1.0 + incr_rate) until the max. fetch interval is reached.- See Also:
- Constant Field Values
-
INTERVAL_DEC_RATE
public static final String INTERVAL_DEC_RATE
Configuration property (float) to set the decrement rate. If a page has changed when refetched, the fetch interval is multiplied by (1.0 - decr_rate). If the fetch interval comes closer to the minimum interval, the decrementing is slowed down.- See Also:
- Constant Field Values
-
SIGNATURE_KEY
public static final String SIGNATURE_KEY
Name of the signature key in metadata, must be defined as "keyName" in the configuration ofMD5SignatureParseFilter
. This key must be listed in "metadata.persist".- See Also:
- Constant Field Values
-
SIGNATURE_OLD_KEY
public static final String SIGNATURE_OLD_KEY
Name of key to hold previous signature: a copy, not overwritten byMD5SignatureParseFilter
. This key is a temporary copy, not necessarily persisted in metadata.- See Also:
- Constant Field Values
-
FETCH_INTERVAL_KEY
public static final String FETCH_INTERVAL_KEY
Key to store the current fetch interval value, must be listed in "metadata.persist".- See Also:
- Constant Field Values
-
SIGNATURE_MODIFIED_KEY
public static final String SIGNATURE_MODIFIED_KEY
Key to store the date when the signature has been changed, must be listed in "metadata.persist".- See Also:
- Constant Field Values
-
defaultfetchInterval
protected int defaultfetchInterval
-
minFetchInterval
protected int minFetchInterval
-
maxFetchInterval
protected int maxFetchInterval
-
fetchIntervalDecRate
protected float fetchIntervalDecRate
-
fetchIntervalIncRate
protected float fetchIntervalIncRate
-
setLastModified
protected boolean setLastModified
-
overwriteLastModified
protected boolean overwriteLastModified
-
-
Method Detail
-
init
public void init(Map<String,Object> stormConf)
Description copied from class:Scheduler
Configuration of the scheduler based on the config. Should be called by Scheduler.getInstance() *- Overrides:
init
in classDefaultScheduler
-
schedule
public Optional<Date> schedule(Status status, Metadata metadata)
Description copied from class:Scheduler
Returns an optional Date indicating when the document should be refetched next, based on its status. It is empty if the URL should never be refetched.- Overrides:
schedule
in classDefaultScheduler
-
-