Class CrawlServer

java.lang.Object
org.archive.modules.net.CrawlServer
All Implemented Interfaces:
Serializable, FetchStats.HasFetchStats, org.archive.util.IdentityCacheable

public class CrawlServer
extends Object
implements Serializable, FetchStats.HasFetchStats, org.archive.util.IdentityCacheable
Represents a single remote "server". A server is a service on a host. There might be more than one service on a host differentiated by a port number.
Author:
gojomo
See Also:
Serialized Form
  • Field Details

    • ROBOTS_NOT_FETCHED

      public static final long ROBOTS_NOT_FETCHED
      See Also:
      Constant Field Values
    • MIN_ROBOTS_RETRIES

      public static final long MIN_ROBOTS_RETRIES
      only check if robots-fetch is perhaps superfluous after this many tries
      See Also:
      Constant Field Values
    • robotstxt

      protected Robotstxt robotstxt
    • robotsFetched

      protected long robotsFetched
    • validRobots

      protected boolean validRobots
    • substats

      protected FetchStats substats
    • consecutiveConnectionErrors

      protected int consecutiveConnectionErrors
  • Constructor Details

    • CrawlServer

      public CrawlServer​(String h)
      Creates a new CrawlServer object.
      Parameters:
      h - the host string for the server.
  • Method Details

    • toString

      public String toString()
      Overrides:
      toString in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • equals

      public boolean equals​(Object obj)
      Overrides:
      equals in class Object
    • getRobotstxt

      public Robotstxt getRobotstxt()
    • updateRobots

      public void updateRobots​(CrawlURI curi)
      Update the server's robotstxt

      Heritrix policy on robots.txt http responses:

      • 2xx: conditional allow (parse robots.txt)
      • 3xx: full allow
      • 4xx: full allow
      • 5xx: full allow
      • Unsuccessful requests or incomplete data: full allow

      For comparison, google's policy as of Oct 2017:

      • 2xx: conditional allow (parse robots.txt)
      • 3xx: conditional allow (attempt to follow redirect and parse robots.txt)
      • 4xx: full allow
      • 5xx: full disallow
      • "Unsuccessful requests or incomplete data: Handling of a robots.txt file which cannot be fetched due to DNS or networking issues such as timeouts, invalid responses, reset / hung up connections, HTTP chunking errors, etc. is undefined."
      https://developers.google.com/search/reference/robots_txt#handling-http-result-codes
      Parameters:
      curi - the crawl URI containing the fetched robots.txt
    • getName

      public String getName()
      Returns:
      The server string which might include a port number.
    • getPort

      public int getPort()
      Get the port number for this server.
      Returns:
      the port number or -1 if not known (uses default for protocol)
    • incrementConsecutiveConnectionErrors

      public void incrementConsecutiveConnectionErrors()
    • resetConsecutiveConnectionErrors

      public void resetConsecutiveConnectionErrors()
    • getCredentials

      public Set<Credential> getCredentials()
      Returns:
      Credential avatars for this server. Returns null if none.
    • hasCredentials

      public boolean hasCredentials()
      Returns:
      True if there are avatars attached to this instance.
    • addCredential

      public void addCredential​(Credential cred)
      Add an avatar.
      Parameters:
      cred - Credential avatar to add to set of avatars.
    • isValidRobots

      public boolean isValidRobots()
      If true then valid robots.txt information has been retrieved. If false either no attempt has been made to fetch robots.txt or the attempt failed.
      Returns:
      Returns the validRobots.
    • getServerKey

      public static String getServerKey​(org.archive.net.UURI uuri) throws org.apache.commons.httpclient.URIException
      Get key to use doing lookup on server instances.
      Returns:
      String to use as server key.
      Throws:
      org.apache.commons.httpclient.URIException
    • getSubstats

      public FetchStats getSubstats()
      Specified by:
      getSubstats in interface FetchStats.HasFetchStats
    • isRobotsExpired

      public boolean isRobotsExpired​(int validityDuration)
      Is the robots policy expired. This method will also return true if we haven't tried to get the robots.txt for this server.
      Returns:
      true if the robots policy is expired.
    • autoregisterTo

      public static void autoregisterTo​(org.archive.bdb.AutoKryo kryo)
    • getKey

      public String getKey()
      Specified by:
      getKey in interface org.archive.util.IdentityCacheable
    • makeDirty

      public void makeDirty()
      Specified by:
      makeDirty in interface org.archive.util.IdentityCacheable
    • setIdentityCache

      public void setIdentityCache​(org.archive.util.ObjectIdentityCache<?> cache)
      Specified by:
      setIdentityCache in interface org.archive.util.IdentityCacheable
    • getHttpAuthChallenges

      public Map<String,​String> getHttpAuthChallenges()
    • setHttpAuthChallenges

      public void setHttpAuthChallenges​(Map<String,​String> httpAuthChallenges)