Package org.archive.modules.net
Class CrawlServer
java.lang.Object
org.archive.modules.net.CrawlServer
- All Implemented Interfaces:
Serializable
,FetchStats.HasFetchStats
,org.archive.util.IdentityCacheable
public class CrawlServer extends Object implements Serializable, FetchStats.HasFetchStats, org.archive.util.IdentityCacheable
Represents a single remote "server".
A server is a service on a host. There might be more than one service on a
host differentiated by a port number.
- Author:
- gojomo
- See Also:
- Serialized Form
-
Field Summary
Fields Modifier and Type Field Description protected int
consecutiveConnectionErrors
static long
MIN_ROBOTS_RETRIES
only check if robots-fetch is perhaps superfluous after this many triesstatic long
ROBOTS_NOT_FETCHED
protected long
robotsFetched
protected Robotstxt
robotstxt
protected FetchStats
substats
protected boolean
validRobots
-
Constructor Summary
Constructors Constructor Description CrawlServer(String h)
Creates a new CrawlServer object. -
Method Summary
Modifier and Type Method Description void
addCredential(Credential cred)
Add an avatar.static void
autoregisterTo(org.archive.bdb.AutoKryo kryo)
boolean
equals(Object obj)
Set<Credential>
getCredentials()
Map<String,String>
getHttpAuthChallenges()
String
getKey()
String
getName()
int
getPort()
Get the port number for this server.Robotstxt
getRobotstxt()
static String
getServerKey(org.archive.net.UURI uuri)
Get key to use doing lookup on server instances.FetchStats
getSubstats()
boolean
hasCredentials()
int
hashCode()
void
incrementConsecutiveConnectionErrors()
boolean
isRobotsExpired(int validityDuration)
Is the robots policy expired.boolean
isValidRobots()
If true then valid robots.txt information has been retrieved.void
makeDirty()
void
resetConsecutiveConnectionErrors()
void
setHttpAuthChallenges(Map<String,String> httpAuthChallenges)
void
setIdentityCache(org.archive.util.ObjectIdentityCache<?> cache)
String
toString()
void
updateRobots(CrawlURI curi)
Update the server's robotstxt
-
Field Details
-
ROBOTS_NOT_FETCHED
public static final long ROBOTS_NOT_FETCHED- See Also:
- Constant Field Values
-
MIN_ROBOTS_RETRIES
public static final long MIN_ROBOTS_RETRIESonly check if robots-fetch is perhaps superfluous after this many tries- See Also:
- Constant Field Values
-
robotstxt
-
robotsFetched
protected long robotsFetched -
validRobots
protected boolean validRobots -
substats
-
consecutiveConnectionErrors
protected int consecutiveConnectionErrors
-
-
Constructor Details
-
CrawlServer
Creates a new CrawlServer object.- Parameters:
h
- the host string for the server.
-
-
Method Details
-
toString
-
hashCode
public int hashCode() -
equals
-
getRobotstxt
-
updateRobots
Update the server's robotstxtHeritrix policy on robots.txt http responses:
- 2xx: conditional allow (parse robots.txt)
- 3xx: full allow
- 4xx: full allow
- 5xx: full allow
- Unsuccessful requests or incomplete data: full allow
For comparison, google's policy as of Oct 2017:
- 2xx: conditional allow (parse robots.txt)
- 3xx: conditional allow (attempt to follow redirect and parse robots.txt)
- 4xx: full allow
- 5xx: full disallow
- "Unsuccessful requests or incomplete data: Handling of a robots.txt file which cannot be fetched due to DNS or networking issues such as timeouts, invalid responses, reset / hung up connections, HTTP chunking errors, etc. is undefined."
- Parameters:
curi
- the crawl URI containing the fetched robots.txt
-
getName
- Returns:
- The server string which might include a port number.
-
getPort
public int getPort()Get the port number for this server.- Returns:
- the port number or -1 if not known (uses default for protocol)
-
incrementConsecutiveConnectionErrors
public void incrementConsecutiveConnectionErrors() -
resetConsecutiveConnectionErrors
public void resetConsecutiveConnectionErrors() -
getCredentials
- Returns:
- Credential avatars for this server. Returns null if none.
-
hasCredentials
public boolean hasCredentials()- Returns:
- True if there are avatars attached to this instance.
-
addCredential
Add an avatar.- Parameters:
cred
- Credential avatar to add to set of avatars.
-
isValidRobots
public boolean isValidRobots()If true then valid robots.txt information has been retrieved. If false either no attempt has been made to fetch robots.txt or the attempt failed.- Returns:
- Returns the validRobots.
-
getServerKey
public static String getServerKey(org.archive.net.UURI uuri) throws org.apache.commons.httpclient.URIExceptionGet key to use doing lookup on server instances.- Returns:
- String to use as server key.
- Throws:
org.apache.commons.httpclient.URIException
-
getSubstats
- Specified by:
getSubstats
in interfaceFetchStats.HasFetchStats
-
isRobotsExpired
public boolean isRobotsExpired(int validityDuration)Is the robots policy expired. This method will also return true if we haven't tried to get the robots.txt for this server.- Returns:
- true if the robots policy is expired.
-
autoregisterTo
public static void autoregisterTo(org.archive.bdb.AutoKryo kryo) -
getKey
- Specified by:
getKey
in interfaceorg.archive.util.IdentityCacheable
-
makeDirty
public void makeDirty()- Specified by:
makeDirty
in interfaceorg.archive.util.IdentityCacheable
-
setIdentityCache
public void setIdentityCache(org.archive.util.ObjectIdentityCache<?> cache)- Specified by:
setIdentityCache
in interfaceorg.archive.util.IdentityCacheable
-
getHttpAuthChallenges
-
setHttpAuthChallenges
-