Class RobotRulesParser

  • Direct Known Subclasses:
    HttpRobotRulesParser

    public abstract class RobotRulesParser
    extends Object
    This class uses crawler-commons for handling the parsing of robots.txt files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected String agentNames  
      protected static com.github.benmanes.caffeine.cache.Cache<String,​RobotRules> CACHE  
      static String cacheConfigParamName
      Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"
      static crawlercommons.robots.BaseRobotRules EMPTY_RULES
      A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
      protected static com.github.benmanes.caffeine.cache.Cache<String,​RobotRules> ERRORCACHE  
      static String errorcacheConfigParamName
      Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"
      static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
      A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
      static org.slf4j.Logger LOG  
    • Field Detail

      • LOG

        public static final org.slf4j.Logger LOG
      • CACHE

        protected static com.github.benmanes.caffeine.cache.Cache<String,​RobotRules> CACHE
      • ERRORCACHE

        protected static com.github.benmanes.caffeine.cache.Cache<String,​RobotRules> ERRORCACHE
      • cacheConfigParamName

        public static final String cacheConfigParamName
        Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"
        See Also:
        Constant Field Values
      • errorcacheConfigParamName

        public static final String errorcacheConfigParamName
        Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"
        See Also:
        Constant Field Values
      • EMPTY_RULES

        public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
      • FORBID_ALL_RULES

        public static final crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
        A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
      • agentNames

        protected String agentNames
    • Constructor Detail

      • RobotRulesParser

        public RobotRulesParser()
    • Method Detail

      • setConf

        public void setConf​(org.apache.storm.Config conf)
        Set the Configuration object
      • parseRules

        public crawlercommons.robots.BaseRobotRules parseRules​(String url,
                                                               byte[] content,
                                                               String contentType,
                                                               String robotName)
        Parses the robots content using the SimpleRobotRulesParser from crawler commons
        Parameters:
        url - A string containing url
        content - Contents of the robots file in a byte array
        contentType - The
        robotName - A string containing value of
        Returns:
        BaseRobotRules object
      • getRobotRulesSet

        public crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol protocol,
                                                                     String url)
      • getRobotRulesSet

        public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol protocol,
                                                                              URL url)