Class RobotRulesParser
- java.lang.Object
-
- com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
- Direct Known Subclasses:
HttpRobotRulesParser
public abstract class RobotRulesParser extends Object
This class uses crawler-commons for handling the parsing ofrobots.txt
files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser.
-
-
Field Summary
Fields Modifier and Type Field Description protected String
agentNames
protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules>
CACHE
static String
cacheConfigParamName
Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"static crawlercommons.robots.BaseRobotRules
EMPTY_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is empty or missing; all requests are allowed.protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules>
ERRORCACHE
static String
errorcacheConfigParamName
Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"static crawlercommons.robots.BaseRobotRules
FORBID_ALL_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is not fetched due to a403/Forbidden
response; all requests are disallowed.static org.slf4j.Logger
LOG
-
Constructor Summary
Constructors Constructor Description RobotRulesParser()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol protocol, String url)
abstract crawlercommons.robots.BaseRobotRules
getRobotRulesSet(Protocol protocol, URL url)
crawlercommons.robots.BaseRobotRules
parseRules(String url, byte[] content, String contentType, String robotName)
Parses the robots content using theSimpleRobotRulesParser
from crawler commonsvoid
setConf(org.apache.storm.Config conf)
Set theConfiguration
object
-
-
-
Field Detail
-
LOG
public static final org.slf4j.Logger LOG
-
CACHE
protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules> CACHE
-
ERRORCACHE
protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules> ERRORCACHE
-
cacheConfigParamName
public static final String cacheConfigParamName
Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"- See Also:
- Constant Field Values
-
errorcacheConfigParamName
public static final String errorcacheConfigParamName
Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"- See Also:
- Constant Field Values
-
EMPTY_RULES
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is empty or missing; all requests are allowed.
-
FORBID_ALL_RULES
public static final crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
ABaseRobotRules
object appropriate for use when therobots.txt
file is not fetched due to a403/Forbidden
response; all requests are disallowed.
-
agentNames
protected String agentNames
-
-
Method Detail
-
setConf
public void setConf(org.apache.storm.Config conf)
Set theConfiguration
object
-
parseRules
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
Parses the robots content using theSimpleRobotRulesParser
from crawler commons- Parameters:
url
- A string containing urlcontent
- Contents of the robots file in a byte arraycontentType
- TherobotName
- A string containing value of- Returns:
- BaseRobotRules object
-
getRobotRulesSet
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, String url)
-
-