Class HttpRobotRulesParser
- java.lang.Object
-
- com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
- com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
-
public class HttpRobotRulesParser extends RobotRulesParser
This class is used for parsing robots for urls belonging to HTTP protocol. It extends the genericRobotRulesParser
class and contains Http protocol specific implementation for obtaining the robots file.
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
allowForbidden
protected Metadata
fetchRobotsMd
-
Fields inherited from class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
agentNames, CACHE, cacheConfigParamName, EMPTY_RULES, ERRORCACHE, errorcacheConfigParamName, FORBID_ALL_RULES, LOG
-
-
Constructor Summary
Constructors Constructor Description HttpRobotRulesParser(org.apache.storm.Config conf)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected static String
getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URLcrawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol http, URL url)
Get the rules from robots.txt which applies for the givenurl
.crawlercommons.robots.BaseRobotRules
getRobotRulesSetFromCache(URL url)
Returns the robots rules from the cache or empty rules if not foundvoid
setConf(org.apache.storm.Config conf)
Set theConfiguration
object-
Methods inherited from class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
getRobotRulesSet, parseRules
-
-
-
-
Field Detail
-
allowForbidden
protected boolean allowForbidden
-
fetchRobotsMd
protected Metadata fetchRobotsMd
-
-
Method Detail
-
setConf
public void setConf(org.apache.storm.Config conf)
Description copied from class:RobotRulesParser
Set theConfiguration
object- Overrides:
setConf
in classRobotRulesParser
-
getCacheKey
protected static String getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL
-
getRobotRulesSetFromCache
public crawlercommons.robots.BaseRobotRules getRobotRulesSetFromCache(URL url)
Returns the robots rules from the cache or empty rules if not found- See Also:
RobotsFilter
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url)
Get the rules from robots.txt which applies for the givenurl
. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.- Specified by:
getRobotRulesSet
in classRobotRulesParser
- Parameters:
http
- TheProtocol
objecturl
- URL robots.txt applies to- Returns:
BaseRobotRules
holding the rules from robots.txt
-
-