org.archive.modules.net.CrawlServer

All Implemented Interfaces:: Serializable, FetchStats.HasFetchStats, org.archive.util.IdentityCacheable

public class CrawlServer
extends Object
implements Serializable, FetchStats.HasFetchStats, org.archive.util.IdentityCacheable

Represents a single remote "server". A server is a service on a host. There might be more than one service on a host differentiated by a port number.

Author:: gojomo
See Also:: Serialized Form

Field Summary

Fields
Modifier and Type	Field	Description
`protected int`	`consecutiveConnectionErrors`
`static long`	`MIN_ROBOTS_RETRIES`	only check if robots-fetch is perhaps superfluous after this many tries
`static long`	`ROBOTS_NOT_FETCHED`
`protected long`	`robotsFetched`
`protected Robotstxt`	`robotstxt`
`protected FetchStats`	`substats`
`protected boolean`	`validRobots`

Constructor Summary

Constructors

Constructor Description

CrawlServer(String h)
Creates a new CrawlServer object.

Method Summary

Modifier and Type	Method	Description
`void`	`addCredential(Credential cred)`	Add an avatar.
`static void`	`autoregisterTo(org.archive.bdb.AutoKryo kryo)`
`boolean`	`equals(Object obj)`
`Set<Credential>`	`getCredentials()`
`Map<String,String>`	`getHttpAuthChallenges()`
`String`	`getKey()`
`String`	`getName()`
`int`	`getPort()`	Get the port number for this server.
`Robotstxt`	`getRobotstxt()`
`static String`	`getServerKey(org.archive.net.UURI uuri)`	Get key to use doing lookup on server instances.
`FetchStats`	`getSubstats()`
`boolean`	`hasCredentials()`
`int`	`hashCode()`
`void`	`incrementConsecutiveConnectionErrors()`
`boolean`	`isRobotsExpired(int validityDuration)`	Is the robots policy expired.
`boolean`	`isValidRobots()`	If true then valid robots.txt information has been retrieved.
`void`	`makeDirty()`
`void`	`resetConsecutiveConnectionErrors()`
`void`	`setHttpAuthChallenges(Map<String,String> httpAuthChallenges)`
`void`	`setIdentityCache(org.archive.util.ObjectIdentityCache<?> cache)`
`String`	`toString()`
`void`	`updateRobots(CrawlURI curi)`	Update the server's robotstxt

Methods inherited from class java.lang.Object

clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- ROBOTS_NOT_FETCHED
  
  public static final long ROBOTS_NOT_FETCHED
  
  See Also:
  
  Constant Field Values
- MIN_ROBOTS_RETRIES
  
  public static final long MIN_ROBOTS_RETRIES
  
  only check if robots-fetch is perhaps superfluous after this many tries
  
  See Also:
  
  Constant Field Values
- robotstxt
  
  protected Robotstxt robotstxt
- robotsFetched
  
  protected long robotsFetched
- validRobots
  
  protected boolean validRobots
- substats
  
  protected FetchStats substats
- consecutiveConnectionErrors
  
  protected int consecutiveConnectionErrors
Constructor Details
- CrawlServer
  
  public CrawlServer(String h)
  
  Creates a new CrawlServer object.
  
  Parameters:
  
  h - the host string for the server.
Method Details
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- hashCode
  
  public int hashCode()
  
  Overrides:
  
  hashCode in class Object
- equals
  
  public boolean equals(Object obj)
  
  Overrides:
  
  equals in class Object
- getRobotstxt
  
  public Robotstxt getRobotstxt()
- updateRobots
  
  public void updateRobots(CrawlURI curi)
  Update the server's robotstxt
  Heritrix policy on robots.txt http responses:
  
  2xx: conditional allow (parse robots.txt)
  3xx: full allow
  4xx: full allow
  5xx: full allow
  Unsuccessful requests or incomplete data: full allow
  
  For comparison, google's policy as of Oct 2017:
  
  2xx: conditional allow (parse robots.txt)
  3xx: conditional allow (attempt to follow redirect and parse robots.txt)
  4xx: full allow
  5xx: full disallow
  "Unsuccessful requests or incomplete data: Handling of a robots.txt file which cannot be fetched due to DNS or networking issues such as timeouts, invalid responses, reset / hung up connections, HTTP chunking errors, etc. is undefined."
  https://developers.google.com/search/reference/robots_txt#handling-http-result-codes
  Parameters:
  
  curi - the crawl URI containing the fetched robots.txt
- getName
  
  public String getName()
  
  Returns:
  
  The server string which might include a port number.
- getPort
  
  public int getPort()
  
  Get the port number for this server.
  
  Returns:
  
  the port number or -1 if not known (uses default for protocol)
- incrementConsecutiveConnectionErrors
  
  public void incrementConsecutiveConnectionErrors()
- resetConsecutiveConnectionErrors
  
  public void resetConsecutiveConnectionErrors()
- getCredentials
  
  public Set<Credential> getCredentials()
  
  Returns:
  
  Credential avatars for this server. Returns null if none.
- hasCredentials
  
  public boolean hasCredentials()
  
  Returns:
  
  True if there are avatars attached to this instance.
- addCredential
  
  public void addCredential(Credential cred)
  
  Add an avatar.
  
  Parameters:
  
  cred - Credential avatar to add to set of avatars.
- isValidRobots
  
  public boolean isValidRobots()
  
  If true then valid robots.txt information has been retrieved. If false either no attempt has been made to fetch robots.txt or the attempt failed.
  
  Returns:
  
  Returns the validRobots.
- getServerKey
  
  public static String getServerKey(org.archive.net.UURI uuri) throws org.apache.commons.httpclient.URIException
  
  Get key to use doing lookup on server instances.
  
  Returns:
  
  String to use as server key.
  
  Throws:
  
  org.apache.commons.httpclient.URIException
- getSubstats
  
  public FetchStats getSubstats()
  
  Specified by:
  
  getSubstats in interface FetchStats.HasFetchStats
- isRobotsExpired
  
  public boolean isRobotsExpired(int validityDuration)
  
  Is the robots policy expired. This method will also return true if we haven't tried to get the robots.txt for this server.
  
  Returns:
  
  true if the robots policy is expired.
- autoregisterTo
  
  public static void autoregisterTo(org.archive.bdb.AutoKryo kryo)
- getKey
  
  public String getKey()
  
  Specified by:
  
  getKey in interface org.archive.util.IdentityCacheable
- makeDirty
  
  public void makeDirty()
  
  Specified by:
  
  makeDirty in interface org.archive.util.IdentityCacheable
- setIdentityCache
  
  public void setIdentityCache(org.archive.util.ObjectIdentityCache<?> cache)
  
  Specified by:
  
  setIdentityCache in interface org.archive.util.IdentityCacheable
- getHttpAuthChallenges
  
  public Map<String,String> getHttpAuthChallenges()
- setHttpAuthChallenges
  
  public void setHttpAuthChallenges(Map<String,String> httpAuthChallenges)

Class CrawlServer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

ROBOTS_NOT_FETCHED

MIN_ROBOTS_RETRIES

robotstxt

robotsFetched

validRobots

substats

consecutiveConnectionErrors

Constructor Details

CrawlServer

Method Details

toString

hashCode

equals

getRobotstxt

updateRobots

getName

getPort

incrementConsecutiveConnectionErrors

resetConsecutiveConnectionErrors

getCredentials

hasCredentials

addCredential

isValidRobots

getServerKey

getSubstats

isRobotsExpired

autoregisterTo

getKey

makeDirty

setIdentityCache

getHttpAuthChallenges

setHttpAuthChallenges