java.lang.Object
- com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- - com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser

```
public class HttpRobotRulesParser
extends RobotRulesParser
```
This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.

Field Summary

Fields
Modifier and Type Field Description

protected boolean allowForbidden

protected Metadata fetchRobotsMd
- Fields inherited from class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
  agentNames, CACHE, cacheConfigParamName, EMPTY_RULES, ERRORCACHE, errorcacheConfigParamName, FORBID_ALL_RULES, LOG

Constructor Summary

Constructors
Constructor Description

HttpRobotRulesParser(org.apache.storm.Config conf)

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected static String`	`getCacheKey(URL url)`	Compose unique key to store and access robot rules in cache for given URL
`crawlercommons.robots.BaseRobotRules`	`getRobotRulesSet(Protocol http, URL url)`	Get the rules from robots.txt which applies for the given `url`.
`crawlercommons.robots.BaseRobotRules`	`getRobotRulesSetFromCache(URL url)`	Returns the robots rules from the cache or empty rules if not found
`void`	`setConf(org.apache.storm.Config conf)`	Set the `Configuration` object

Methods inherited from class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
getRobotRulesSet, parseRules

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - allowForbidden
```
protected boolean allowForbidden
```
  - fetchRobotsMd
```
protected Metadata fetchRobotsMd
```
- Constructor Detail
  - HttpRobotRulesParser
```
public HttpRobotRulesParser(org.apache.storm.Config conf)
```
- Method Detail
  - setConf
```
public void setConf(org.apache.storm.Config conf)
```
    Description copied from class: RobotRulesParser
    
    Set the Configuration object
    
    Overrides:
    
    setConf in class RobotRulesParser
  - getCacheKey
```
protected static String getCacheKey(URL url)
```
    Compose unique key to store and access robot rules in cache for given URL
  - getRobotRulesSetFromCache
```
public crawlercommons.robots.BaseRobotRules getRobotRulesSetFromCache(URL url)
```
    Returns the robots rules from the cache or empty rules if not found
    
    See Also:
    
    RobotsFilter
  - getRobotRulesSet
```
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http,
                                                             URL url)
```
    Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.
    
    Specified by:
    
    getRobotRulesSet in class RobotRulesParser
    
    Parameters:
    
    http - The Protocol object
    
    url - URL robots.txt applies to
    
    Returns:
    
    BaseRobotRules holding the rules from robots.txt

Modifier and Type	Field	Description
`protected boolean`	`allowForbidden`
`protected Metadata`	`fetchRobotsMd`

Class HttpRobotRulesParser

Field Summary

Fields inherited from class com.digitalpebble.stormcrawler.protocol.RobotRulesParser

Constructor Summary

Method Summary

Methods inherited from class com.digitalpebble.stormcrawler.protocol.RobotRulesParser

Methods inherited from class java.lang.Object

Field Detail

allowForbidden

fetchRobotsMd

Constructor Detail

HttpRobotRulesParser

Method Detail

setConf

getCacheKey

getRobotRulesSetFromCache

getRobotRulesSet