java.lang.Object
- com.digitalpebble.stormcrawler.protocol.RobotRulesParser

Direct Known Subclasses:

HttpRobotRulesParser
```
public abstract class RobotRulesParser
extends Object
```
This class uses crawler-commons for handling the parsing of robots.txt files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser.

Field Summary

Fields
Modifier and Type	Field	Description
`protected String`	`agentNames`
`protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules>`	`CACHE`
`static String`	`cacheConfigParamName`	Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"
`static crawlercommons.robots.BaseRobotRules`	`EMPTY_RULES`	A `BaseRobotRules` object appropriate for use when the `robots.txt` file is empty or missing; all requests are allowed.
`protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules>`	`ERRORCACHE`
`static String`	`errorcacheConfigParamName`	Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"
`static crawlercommons.robots.BaseRobotRules`	`FORBID_ALL_RULES`	A `BaseRobotRules` object appropriate for use when the `robots.txt` file is not fetched due to a `403/Forbidden` response; all requests are disallowed.
`static org.slf4j.Logger`	`LOG`

Constructor Summary

Constructors
Constructor Description

RobotRulesParser()

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method	Description
`crawlercommons.robots.BaseRobotRules`	`getRobotRulesSet(Protocol protocol, String url)`
`abstract crawlercommons.robots.BaseRobotRules`	`getRobotRulesSet(Protocol protocol, URL url)`
`crawlercommons.robots.BaseRobotRules`	`parseRules(String url, byte[] content, String contentType, String robotName)`	Parses the robots content using the `SimpleRobotRulesParser` from crawler commons
`void`	`setConf(org.apache.storm.Config conf)`	Set the `Configuration` object

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- LOG
```
public static final org.slf4j.Logger LOG
```
- CACHE
```
protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules> CACHE
```
- ERRORCACHE
```
protected static com.github.benmanes.caffeine.cache.Cache<String,RobotRules> ERRORCACHE
```
- cacheConfigParamName
```
public static final String cacheConfigParamName
```
  Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"
  
  See Also:
  
  Constant Field Values
- errorcacheConfigParamName
```
public static final String errorcacheConfigParamName
```
  Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"
  
  See Also:
  
  Constant Field Values
- EMPTY_RULES
```
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
```
  A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
- FORBID_ALL_RULES
```
public static final crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
```
  A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
- agentNames
```
protected String agentNames
```

Constructor Detail
- RobotRulesParser
```
public RobotRulesParser()
```

Method Detail

setConf

public void setConf(org.apache.storm.Config conf)

Set the Configuration object

parseRules

public crawlercommons.robots.BaseRobotRules parseRules(String url,
                                                       byte[] content,
                                                       String contentType,
                                                       String robotName)

Parses the robots content using the SimpleRobotRulesParser from crawler commons

Parameters:: url - A string containing url; content - Contents of the robots file in a byte array; contentType - The; robotName - A string containing value of
Returns:: BaseRobotRules object

getRobotRulesSet

public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                             String url)

getRobotRulesSet

public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                                      URL url)

Class RobotRulesParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

LOG

CACHE

ERRORCACHE

cacheConfigParamName

errorcacheConfigParamName

EMPTY_RULES

FORBID_ALL_RULES

agentNames

Constructor Detail

RobotRulesParser

Method Detail

setConf

parseRules

getRobotRulesSet

getRobotRulesSet