Class HttpRobotRulesParser


  • public class HttpRobotRulesParser
    extends RobotRulesParser
    This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.
    • Field Detail

      • allowForbidden

        protected boolean allowForbidden
      • fetchRobotsMd

        protected Metadata fetchRobotsMd
    • Constructor Detail

      • HttpRobotRulesParser

        public HttpRobotRulesParser​(org.apache.storm.Config conf)
    • Method Detail

      • getCacheKey

        protected static String getCacheKey​(URL url)
        Compose unique key to store and access robot rules in cache for given URL
      • getRobotRulesSetFromCache

        public crawlercommons.robots.BaseRobotRules getRobotRulesSetFromCache​(URL url)
        Returns the robots rules from the cache or empty rules if not found
        See Also:
        RobotsFilter
      • getRobotRulesSet

        public crawlercommons.robots.BaseRobotRules getRobotRulesSet​(Protocol http,
                                                                     URL url)
        Get the rules from robots.txt which applies for the given url. Robot rules are cached for a unique combination of host, protocol, and port. If no rules are found in the cache, a HTTP request is send to fetch {{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the rules are cached to avoid re-fetching and re-parsing it again.
        Specified by:
        getRobotRulesSet in class RobotRulesParser
        Parameters:
        http - The Protocol object
        url - URL robots.txt applies to
        Returns:
        BaseRobotRules holding the rules from robots.txt